Sparksql過濾與多個條件(與where子句中選擇) [英] Sparksql filtering (selecting with where clause) with multiple conditions

查看:16006
本文介紹了Sparksql過濾與多個條件(與where子句中選擇)的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

您好我有以下問題:

numeric.registerTempTable("numeric"). 

所有我想要過濾的值是文字空字符串,而不是N / A或Null值。

All the values that I want to filter on are literal null strings and not N/A or Null values.

我想這三個選項:


  1. numeric_filtered = numeric.filter(數字['低']!='空')。過濾器(數字['HIGH']!='空')。濾波器(數字[ 正常]!='空')

numeric_filtered = numeric.filter(數字['低']!='空'和數字['HIGH']!='空'和數字['正常']!= '空')

sqlContext.sql(SELECT * FROM數字需要低!='空'AND HIGH!='空'和正常!='空')

不幸的是,numeric_filtered總是空空的。我檢查和數字有應根據這些條件進行過濾的數據。

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

下面是一些樣本值:

低高正常

3.5 5.0空

2.0 14.0空

空空38.0

空空空

1.0空4.0

推薦答案

您使用的是邏輯與(AND)。這意味著所有列必須比不同的'空'對列入一行。讓我們舉例說明,使用過濾版本為例:

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

您已經嘗試了所有剩下的方法遵循完全相同的架構。你所需要的是一個邏輯析?。∣R)。

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

或原始SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

這篇關于Sparksql過濾與多個條件(與where子句中選擇)的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持IT屋!

查看全文
登錄 關閉
掃碼關注1秒登錄
發送“驗證碼”獲取 | 15天全站免登陸
全免费A级毛片免费看无码播放