當(dāng)前位置：首頁 > news >正文

做企業(yè)網(wǎng)站注意些啥百度指數(shù)數(shù)據(jù)分析平臺(tái)官網(wǎng)

news 2025/7/5 10:11:54

做企業(yè)網(wǎng)站注意些啥,百度指數(shù)數(shù)據(jù)分析平臺(tái)官網(wǎng),溫州公司建設(shè)網(wǎng)站制作,政府網(wǎng)站一般用什么做測(cè)試環(huán)境：hive on spark spark版本：3.3.1 一、執(zhí)行時(shí)機(jī)二、對(duì)結(jié)果集的影響三、效率對(duì)比1.內(nèi)連接1）on2）where 2.外連接1）on2）where 四、總結(jié)PS 一、執(zhí)行時(shí)機(jī) sql連接中，where屬于過濾條件&#…

測(cè)試環(huán)境：hive on spark
spark版本：3.3.1

一、執(zhí)行時(shí)機(jī)
二、對(duì)結(jié)果集的影響
三、效率對(duì)比
- 1.內(nèi)連接
- - 1）on
  - 2）where
- 2.外連接
- - 1）on
  - 2）where
四、總結(jié)
PS

一、執(zhí)行時(shí)機(jī)

sql連接中，where屬于過濾條件，用于對(duì)join的結(jié)果集進(jìn)行過濾，所以理論上的執(zhí)行時(shí)機(jī)在join之后。on屬于關(guān)聯(lián)條件，決定了滿足什么樣條件的數(shù)據(jù)才可以被關(guān)聯(lián)到一起，因此理論上的執(zhí)行時(shí)機(jī)在join時(shí)。

但是，大多數(shù)數(shù)據(jù)庫系統(tǒng)為了提升效率都采用了一些優(yōu)化技術(shù)，思想都是將where中的篩選條件或是on中的關(guān)聯(lián)條件盡可能的提前到數(shù)據(jù)源側(cè)進(jìn)行篩選，目的是減少參與關(guān)聯(lián)的數(shù)據(jù)量。因此它們實(shí)際的執(zhí)行時(shí)機(jī)大多時(shí)候和理論上的不同。

二、對(duì)結(jié)果集的影響

內(nèi)連接中，條件放在where或者on中對(duì)結(jié)果集無影響。

外連接中（以左外連接為例），因?yàn)樽笸膺B接是完全保留左表記錄，on在join時(shí)生效，因此最終的結(jié)果集也會(huì)保留左表的全部記錄。where是對(duì)join后的結(jié)果集進(jìn)行操作，所以會(huì)過濾掉一些數(shù)據(jù)導(dǎo)致二者的結(jié)果集不相同。

三、效率對(duì)比

測(cè)試數(shù)據(jù)量如下：

poi_data.poi_res表：數(shù)據(jù)量8300W+
bi_report.mon_ronghe_pv表：分區(qū)表，總數(shù)據(jù)量120E+，本次采用分區(qū)20240522的數(shù)據(jù)關(guān)聯(lián)，數(shù)據(jù)量5900W+，其中 bid like ‘1%’ & pv>100 的數(shù)據(jù)量120W+

兩表的關(guān)聯(lián)字段均無重復(fù)值。

1.內(nèi)連接

1）on

selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6):  +- AQEShuffleRead (5):     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7):        +- Exchange (3):           +- * Filter (2):              +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20):  +- Exchange (19):     +- Filter (18):        +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=774](4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338](9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=799](12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id=759](20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 > 100))(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id=760](24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlan=true

從物理執(zhí)行計(jì)劃可以看到第（2）步中的Filter使用條件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))在t1表讀取源數(shù)據(jù)時(shí)進(jìn)行了過濾，在第（7）步中通過謂詞下推在t2表scan源數(shù)據(jù)時(shí)使用條件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]進(jìn)行了過濾，兩表都是在數(shù)據(jù)源側(cè)進(jìn)行的數(shù)據(jù)過濾，減少了shuffle和參與join的數(shù)據(jù)量。

2）where

selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1 
join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6):  +- AQEShuffleRead (5):     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7):        +- Exchange (3):           +- * Filter (2):              +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20):  +- Exchange (19):     +- Filter (18):        +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=908](4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391](9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=933](12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id=893](20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 > 100))(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id=894](24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlan=true

物理執(zhí)行計(jì)劃沒有變化，因此可以說，當(dāng)數(shù)據(jù)庫支持謂詞下推時(shí)，篩選條件用where還是on沒有區(qū)別，數(shù)據(jù)庫都會(huì)在數(shù)據(jù)源側(cè)進(jìn)行數(shù)據(jù)過濾，減少參與關(guān)聯(lián)的數(shù)據(jù)量。

2.外連接

1）on

selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
and t2.bid like '1%' and t2.pv>100;

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin LeftOuter (15):- * Sort (6):  +- AQEShuffleRead (5):     +- ShuffleQueryStage (4), Statistics(sizeInBytes=36.5 MiB, rowCount=3.07E+5):        +- Exchange (3):           +- * LocalLimit (2):              +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin LeftOuter (25):- Sort (20):  +- Exchange (19):     +- LocalLimit (18):        +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []](2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1043](4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444](9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1067](12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id=1029](20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 > 100))(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id=1030](24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlan=true

因?yàn)樽箨P(guān)聯(lián)，on中的條件屬于連接條件，結(jié)果需要保留左表全部記錄，所以t1表全量讀取，t2表使用了謂詞下推過濾。

2）where

selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1 
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day='20240522') t2
on t1.bid=t2.bid 
where t2.bid like '1%' and t2.pv>100;

== Physical Plan ==
AdaptiveSparkPlan (28)
+- == Final Plan ==CollectLimit (17)+- * Project (16)+- * SortMergeJoin Inner (15):- * Sort (6):  +- AQEShuffleRead (5):     +- ShuffleQueryStage (4), Statistics(sizeInBytes=5.3 GiB, rowCount=4.57E+7):        +- Exchange (3):           +- * Filter (2):              +- Scan hive poi_data.poi_res (1)+- * Sort (14)+- AQEShuffleRead (13)+- ShuffleQueryStage (12), Statistics(sizeInBytes=58.5 MiB, rowCount=1.28E+6)+- Exchange (11)+- * Project (10)+- * Filter (9)+- * ColumnarToRow (8)+- Scan parquet bi_report.mon_ronghe_pv (7)
+- == Initial Plan ==CollectLimit (27)+- Project (26)+- SortMergeJoin Inner (25):- Sort (20):  +- Exchange (19):     +- Filter (18):        +- Scan hive poi_data.poi_res (1)+- Sort (24)+- Exchange (23)+- Project (22)+- Filter (21)+- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [`poi_data`.`poi_res`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1176](4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day=20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 = 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: struct<bid:string,pv:int>(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497](9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1201](12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id=1161](20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 > 100))(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id=1162](24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlan=true

where屬于過濾條件，影響左關(guān)聯(lián)的最終結(jié)果，所以執(zhí)行計(jì)劃第（2）步中將where提前到j(luò)oin關(guān)聯(lián)之前按照bid對(duì)t1表進(jìn)行過濾。

四、總結(jié)

假設(shè)數(shù)據(jù)庫系統(tǒng)支持謂詞下推的前提下，

內(nèi)連接：內(nèi)連接的兩個(gè)執(zhí)行計(jì)劃中，對(duì)t2表都使用了PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]，對(duì)t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)) ，因此可以說，內(nèi)連接中where和on在執(zhí)行效率上沒區(qū)別。
外連接：還是拿左外連接來說，右表相關(guān)的條件會(huì)使用謂詞下推，而左表是否會(huì)提前過濾數(shù)據(jù)，取決于where還是on以及篩選條件是否與左表相關(guān)，1）當(dāng)為on時(shí)，左表的數(shù)據(jù)必須全量讀取，此時(shí)效率的差別主要取決于左表的數(shù)據(jù)量。2）當(dāng)為where時(shí)，如果篩選條件涉及到左表，則會(huì)進(jìn)行數(shù)據(jù)的提前過濾，否則左表仍然全量讀取。

PS

在內(nèi)連接的物理執(zhí)行計(jì)劃中，對(duì)poi_res表的過濾單獨(dú)作為一個(gè)Filter步驟（2）Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))，而對(duì)mon_ronghe_pv表的過濾在第（7）步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)] ，二者有什么區(qū)別？查了一些資料，說的是可以將PushedFilters理解為在讀取數(shù)據(jù)時(shí)的過濾，不滿足條件的數(shù)據(jù)直接不讀取。Filter時(shí)將數(shù)據(jù)讀取之后，再判斷是否滿足條件，決定是否參與后續(xù)計(jì)算。

既然都是在數(shù)據(jù)源側(cè)進(jìn)行數(shù)據(jù)過濾，為什么Filter不能像PushedFilters那樣，直接在讀取數(shù)據(jù)的時(shí)候判斷，減少讀入的數(shù)據(jù)量呢，這樣也可以提升效率，這是一開始個(gè)人的疑問。查了一些資料，說的是是否支持在scan時(shí)filter數(shù)據(jù)，主要受數(shù)據(jù)源的影響。大數(shù)據(jù)中的存儲(chǔ)方式主要分為行式存儲(chǔ)和列式存儲(chǔ)，列式存儲(chǔ)的數(shù)據(jù)存儲(chǔ)方式和豐富的元數(shù)據(jù)對(duì)謂詞下推技術(shù)有更好的支持。當(dāng)前測(cè)試中，mon_ronghe_pv表的存儲(chǔ)格式為parquet，poi_res表存儲(chǔ)格式text。

查看全文

http://aloenet.com.cn/news/39990.html

国产亚洲精品福利在线无卡一,国产精久久一区二区三区,亚洲精品无码国模,精品久久久久久无码专区不卡

做企業(yè)網(wǎng)站注意些啥百度指數(shù)數(shù)據(jù)分析平臺(tái)官網(wǎng)

一、執(zhí)行時(shí)機(jī)

二、對(duì)結(jié)果集的影響

三、效率對(duì)比

1.內(nèi)連接

1）on

2）where

2.外連接

1）on

2）where

四、總結(jié)

PS

相關(guān)文章：

国产亚洲精品福利在线无卡一,国产精久久一区二区三区,亚洲精品无码国模,精品久久久久久无码专区不卡

一、執(zhí)行時(shí)機(jī)

二、對(duì)結(jié)果集的影響

三、效率對(duì)比

1.內(nèi)連接

1）on

2）where

2.外連接

1）on

2）where

四、總結(jié)

PS

相關(guān)文章：

三、效率對(duì)比

四、總結(jié)