DataFrame

Instance Constructors

new DataFrame(sqlContext: SQLContext, logicalPlan: LogicalPlan)

A constructor that automatically analyzes the logical plan.
A constructor that automatically analyzes the logical plan.
This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
def agg(expr: Column, exprs: Column*): DataFrame

Aggregates on the entire DataFrame without groups.
Aggregates on the entire DataFrame without groups.
```
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(max($"age"), avg($"salary"))
df.groupBy().agg(max($"age"), avg($"salary"))
```
Annotations
@varargs()
Since
1.3.0
def agg(exprs: Map[String, String]): DataFrame

(Java-specific) Aggregates on the entire DataFrame without groups.
(Java-specific) Aggregates on the entire DataFrame without groups.
```
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
```
Since
1.3.0
def agg(exprs: Map[String, String]): DataFrame

(Scala-specific) Aggregates on the entire DataFrame without groups.
(Scala-specific) Aggregates on the entire DataFrame without groups.
```
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(Map("age" -> "max", "salary" -> "avg"))
df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
```
Since
1.3.0
def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

(Scala-specific) Aggregates on the entire DataFrame without groups.
(Scala-specific) Aggregates on the entire DataFrame without groups.
```
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg("age" -> "max", "salary" -> "avg")
df.groupBy().agg("age" -> "max", "salary" -> "avg")
```
Since
1.3.0
def alias(alias: Symbol): DataFrame

(Scala-specific) Returns a new DataFrame with an alias set.
(Scala-specific) Returns a new DataFrame with an alias set. Same as as.

Since
1.6.0
def alias(alias: String): DataFrame

Returns a new DataFrame with an alias set.
Returns a new DataFrame with an alias set. Same as as.

Since
1.6.0
def apply(colName: String): Column

Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Since
1.3.0
def as(alias: Symbol): DataFrame

(Scala-specific) Returns a new DataFrame with an alias set.
(Scala-specific) Returns a new DataFrame with an alias set.

Since
1.3.0
def as(alias: String): DataFrame

Returns a new DataFrame with an alias set.
Returns a new DataFrame with an alias set.

Since
1.3.0
def as[U](implicit arg0: Encoder[U]): Dataset[U]

:: Experimental :: Converts this DataFrame to a strongly-typed Dataset containing objects of the specified type, U.
:: Experimental :: Converts this DataFrame to a strongly-typed Dataset containing objects of the specified type, U.

Annotations
@Experimental()
Since
1.6.0
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cache(): DataFrame.this.type

Persist this DataFrame with the default storage level (MEMORY_AND_DISK).
Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

Since
1.3.0
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def coalesce(numPartitions: Int): DataFrame

Returns a new DataFrame that has exactly numPartitions partitions.
Returns a new DataFrame that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

Since
1.4.0
def col(colName: String): Column

Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Since
1.3.0
def collect(): Array[Row]

Returns an array that contains all of Rows in this DataFrame.
Returns an array that contains all of Rows in this DataFrame.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.

Since
1.3.0
def collectAsList(): List[Row]

Returns a Java list that contains all of Rows in this DataFrame.
Returns a Java list that contains all of Rows in this DataFrame.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

Since
1.3.0
def collectToPython(): Int

Attributes
protected[org.apache.spark.sql]
def columns: Array[String]

Returns all column names as an array.
Returns all column names as an array.

Since
1.3.0
def count(): Long

Returns the number of rows in the DataFrame.
Returns the number of rows in the DataFrame.

Since
1.3.0
def cube(col1: String, cols: String*): GroupedData

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns cubed by department and group.
df.cube("department", "group").avg()

// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.4.0
def cube(cols: Column*): GroupedData

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
```
// Compute the average for all numeric columns cubed by department and group.
df.cube($"department", $"group").avg()

// Compute the max age and average salary, cubed by department and gender.
df.cube($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.4.0
def describe(cols: String*): DataFrame

Computes statistics for numeric columns, including count, mean, stddev, min, and max.
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg function instead.
```
df.describe("age", "height").show()

// output:
// summary age   height
// count   10.0  10.0
// mean    53.3  178.05
// stddev  11.6  15.7
// min     18.0  163.0
// max     92.0  192.0
```
Annotations
@varargs()
Since
1.3.1
def distinct(): DataFrame

Returns a new DataFrame that contains only the unique rows from this DataFrame.
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

Since
1.3.0
def drop(col: Column): DataFrame

Returns a new DataFrame with a column dropped.
Returns a new DataFrame with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.

Since
1.4.1
def drop(colName: String): DataFrame

Returns a new DataFrame with a column dropped.
Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

Since
1.4.0
def dropDuplicates(colNames: Array[String]): DataFrame

Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

Since
1.4.0
def dropDuplicates(colNames: Seq[String]): DataFrame

(Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.

Since
1.4.0
def dropDuplicates(): DataFrame

Returns a new DataFrame that contains only the unique rows from this DataFrame.
Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

Since
1.4.0
def dtypes: Array[(String, String)]

Returns all column names and their data types as an array.
Returns all column names and their data types as an array.

Since
1.3.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def except(other: DataFrame): DataFrame

Returns a new DataFrame containing rows in this frame but not in another frame.
Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

Since
1.3.0
def explain(): Unit

Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.

Definition Classes
DataFrame → Queryable
Since
1.3.0
def explain(extended: Boolean): Unit

Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.

Definition Classes
DataFrame → Queryable
Since
1.3.0
def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

(Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.
```
df.explode("words", "word"){words: String => words.split(" ")}
```
Since
1.3.0
def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

(Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.
The following example uses this function to count the number of books which contain a given word:
```
case class Book(title: String, words: String)
val df: RDD[Book]

case class Word(word: String)
val allWords = df.explode('words) {
  case Row(words: String) => words.split(" ").map(Word(_))
}

val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
```
Since
1.3.0
def filter(conditionExpr: String): DataFrame

Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
```
peopleDf.filter("age > 15")
```
Since
1.3.0
def filter(condition: Column): DataFrame

Filters rows using the given condition.
Filters rows using the given condition.
```
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
```
Since
1.3.0
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def first(): Row

Returns the first row.
Returns the first row. Alias for head().

Since
1.3.0
def flatMap[R](f: (Row) ⇒ TraversableOnce[R])(implicit arg0: ClassTag[R]): RDD[R]

Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.

Since
1.3.0
def foreach(f: (Row) ⇒ Unit): Unit

Applies a function f to all rows.
Applies a function f to all rows.

Since
1.3.0
def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

Applies a function f to each partition of this DataFrame.
Applies a function f to each partition of this DataFrame.

Since
1.3.0
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def groupBy(col1: String, cols: String*): GroupedData

Groups the DataFrame using the specified columns, so we can run aggregation on them.
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns grouped by department.
df.groupBy("department").avg()

// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.3.0
def groupBy(cols: Column*): GroupedData

Groups the DataFrame using the specified columns, so we can run aggregation on them.
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
```
// Compute the average for all numeric columns grouped by department.
df.groupBy($"department").avg()

// Compute the max age and average salary, grouped by department and gender.
df.groupBy($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.3.0
def hashCode(): Int

Definition Classes
AnyRef → Any
def head(): Row

Returns the first row.
Returns the first row.

Since
1.3.0
def head(n: Int): Array[Row]

Returns the first n rows.
Returns the first n rows.

Since
1.3.0
def inputFiles: Array[String]

Returns a best-effort snapshot of the files that compose this DataFrame.
Returns a best-effort snapshot of the files that compose this DataFrame. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
def intersect(other: DataFrame): DataFrame

Returns a new DataFrame containing rows only in both this frame and another frame.
Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

Since
1.3.0
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isLocal: Boolean

Returns true if the collect and take methods can be run locally (without any Spark executors).
Returns true if the collect and take methods can be run locally (without any Spark executors).

Since
1.3.0
def javaRDD: JavaRDD[Row]

Returns the content of the DataFrame as a JavaRDD of Rows.
Returns the content of the DataFrame as a JavaRDD of Rows.

Since
1.3.0
def javaToPython: JavaRDD[Array[Byte]]

Converts a JavaRDD to a PythonRDD.
Converts a JavaRDD to a PythonRDD.

Attributes
protected[org.apache.spark.sql]
def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame

Join with another DataFrame, using the given join expression.
Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.
```
// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")

// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
```
right
Right side of the join.
joinExprs
Join expression.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Since
1.3.0
def join(right: DataFrame, joinExprs: Column): DataFrame

Inner join with another DataFrame, using the given join expression.
Inner join with another DataFrame, using the given join expression.
```
// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
```
Since
1.3.0
def join(right: DataFrame, usingColumns: Seq[String], joinType: String): DataFrame

Equi-join with another DataFrame using the given columns.
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumns
Names of the columns to join on. This columns must exist on both sides.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Since
1.6.0
def join(right: DataFrame, usingColumns: Seq[String]): DataFrame

Inner equi-join with another DataFrame using the given columns.
Inner equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
```
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumns
Names of the columns to join on. This columns must exist on both sides.

Since
1.4.0
def join(right: DataFrame, usingColumn: String): DataFrame

Inner equi-join with another DataFrame using the given column.
Inner equi-join with another DataFrame using the given column.
Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
```
// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
```
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
right
Right side of the join operation.
usingColumn
Name of the column to join on. This column must exist on both sides.

Since
1.4.0
def join(right: DataFrame): DataFrame

Cartesian join with another DataFrame.
Cartesian join with another DataFrame.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
right
Right side of the join operation.

Since
1.3.0
def limit(n: Int): DataFrame

Returns a new DataFrame by taking the first n rows.
Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

Since
1.3.0
val logicalPlan: LogicalPlan

Attributes
protected[org.apache.spark.sql]
def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]

Returns a new RDD by applying a function to all rows of this DataFrame.
Returns a new RDD by applying a function to all rows of this DataFrame.

Since
1.3.0
def mapPartitions[R](f: (Iterator[Row]) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

Returns a new RDD by applying a function to each partition of this DataFrame.
Returns a new RDD by applying a function to each partition of this DataFrame.

Since
1.3.0
def na: DataFrameNaFunctions

Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
```
// Dropping rows containing any null values.
df.na.drop()
```
Since
1.3.1
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def numericColumns: Seq[Expression]

Attributes
protected[org.apache.spark.sql]
def orderBy(sortExprs: Column*): DataFrame

Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Annotations
@varargs()
Since
1.3.0
def orderBy(sortCol: String, sortCols: String*): DataFrame

Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions. This is an alias of the sort function.

Annotations
@varargs()
Since
1.3.0
def persist(newLevel: StorageLevel): DataFrame.this.type

Persist this DataFrame with the given storage level.
Persist this DataFrame with the given storage level.
newLevel
One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Since
1.3.0
def persist(): DataFrame.this.type

Persist this DataFrame with the default storage level (MEMORY_AND_DISK).
Persist this DataFrame with the default storage level (MEMORY_AND_DISK).

Since
1.3.0
def printSchema(): Unit

Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.

Definition Classes
DataFrame → Queryable
Since
1.3.0
val queryExecution: QueryExecution

Definition Classes
DataFrame → Queryable
def randomSplit(weights: Array[Double]): Array[DataFrame]

Randomly splits this DataFrame with the provided weights.
Randomly splits this DataFrame with the provided weights.
weights
weights for splits, will be normalized if they don't sum to 1.

Since
1.4.0
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]

Randomly splits this DataFrame with the provided weights.
Randomly splits this DataFrame with the provided weights.
weights
weights for splits, will be normalized if they don't sum to 1.
seed
Seed for sampling.

Since
1.4.0
lazy val rdd: RDD[Row]

Represents the content of the DataFrame as an RDD of Rows.
Represents the content of the DataFrame as an RDD of Rows. Note that the RDD is memoized. Once called, it won't change even if you change any query planning related Spark SQL configurations (e.g. spark.sql.shuffle.partitions).

Since
1.3.0
def registerTempTable(tableName: String): Unit

Registers this DataFrame as a temporary table using the given name.
Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

Since
1.3.0
def repartition(partitionExprs: Column*): DataFrame

Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions.
Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

Annotations
@varargs()
Since
1.6.0
def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame

Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions.
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

Annotations
@varargs()
Since
1.6.0
def repartition(numPartitions: Int): DataFrame

Returns a new DataFrame that has exactly numPartitions partitions.
Returns a new DataFrame that has exactly numPartitions partitions.

Since
1.3.0
def resolve(colName: String): NamedExpression

Attributes
protected[org.apache.spark.sql]
def rollup(col1: String, cols: String*): GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
```
// Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg()

// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.4.0
def rollup(cols: Column*): GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
```
// Compute the average for all numeric columns rolluped by department and group.
df.rollup($"department", $"group").avg()

// Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
  "salary" -> "avg",
  "age" -> "max"
))
```
Annotations
@varargs()
Since
1.4.0
def sample(withReplacement: Boolean, fraction: Double): DataFrame

Returns a new DataFrame by sampling a fraction of rows, using a random seed.
Returns a new DataFrame by sampling a fraction of rows, using a random seed.
withReplacement
Sample with replacement or not.
fraction
Fraction of rows to generate.

Since
1.3.0
def sample(withReplacement: Boolean, fraction: Double, seed: Long): DataFrame

Returns a new DataFrame by sampling a fraction of rows.
Returns a new DataFrame by sampling a fraction of rows.
withReplacement
Sample with replacement or not.
fraction
Fraction of rows to generate.
seed
Seed for sampling.

Since
1.3.0
def schema: StructType

Returns the schema of this DataFrame.
Returns the schema of this DataFrame.

Definition Classes
DataFrame → Queryable
Since
1.3.0
def select(col: String, cols: String*): DataFrame

Selects a set of columns.
Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).
```
// The following two are equivalent:
df.select("colA", "colB")
df.select($"colA", $"colB")
```
Annotations
@varargs()
Since
1.3.0
def select(cols: Column*): DataFrame

Selects a set of column based expressions.
Selects a set of column based expressions.
```
df.select($"colA", $"colB" + 1)
```
Annotations
@varargs()
Since
1.3.0
def selectExpr(exprs: String*): DataFrame

Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.
```
// The following are equivalent:
df.selectExpr("colA", "colB as newName", "abs(colC)")
df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
```
Annotations
@varargs()
Since
1.3.0
def show(numRows: Int, truncate: Boolean): Unit

Displays the DataFrame in a tabular form.
Displays the DataFrame in a tabular form. For example:
```
year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521
```
numRows
Number of rows to show
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.5.0
def show(truncate: Boolean): Unit

Displays the top 20 rows of DataFrame in a tabular form.
Displays the top 20 rows of DataFrame in a tabular form.
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.5.0
def show(): Unit

Displays the top 20 rows of DataFrame in a tabular form.
Displays the top 20 rows of DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

Since
1.3.0

def show(numRows: Int): Unit

Displays the DataFrame in a tabular form.

Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521

numRows: Number of rows to show

Since: 1.3.0

def sort(sortExprs: Column*): DataFrame

Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions. For example:
```
df.sort($"col1", $"col2".desc)
```
Annotations
@varargs()
Since
1.3.0
def sort(sortCol: String, sortCols: String*): DataFrame

Returns a new DataFrame sorted by the specified column, all in ascending order.
Returns a new DataFrame sorted by the specified column, all in ascending order.
```
// The following 3 are equivalent
df.sort("sortcol")
df.sort($"sortcol")
df.sort($"sortcol".asc)
```
Annotations
@varargs()
Since
1.3.0
def sortWithinPartitions(sortExprs: Column*): DataFrame

Returns a new DataFrame with each partition sorted by the given expressions.
Returns a new DataFrame with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).

Annotations
@varargs()
Since
1.6.0
def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame

Returns a new DataFrame with each partition sorted by the given expressions.
Returns a new DataFrame with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).

Annotations
@varargs()
Since
1.6.0
val sqlContext: SQLContext

Definition Classes
DataFrame → Queryable
def stat: DataFrameStatFunctions

Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
```
// Finding frequent items in column with name 'a'.
df.stat.freqItems(Seq("a"))
```
Since
1.4.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def take(n: Int): Array[Row]

Returns the first n rows in the DataFrame.
Returns the first n rows in the DataFrame.
Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Since
1.3.0
def takeAsList(n: Int): List[Row]

Returns the first n rows in the DataFrame as a list.
Returns the first n rows in the DataFrame as a list.
Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Since
1.6.0
def toDF(colNames: String*): DataFrame

Returns a new DataFrame with columns renamed.
Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:
```
val rdd: RDD[(Int, String)] = ...
rdd.toDF()  // this implicit conversion creates a DataFrame with column name _1 and _2
rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
```
Annotations
@varargs()
Since
1.3.0
def toDF(): DataFrame

Returns the object itself.
Returns the object itself.

Since
1.3.0
def toJSON: RDD[String]

Returns the content of the DataFrame as a RDD of JSON strings.
Returns the content of the DataFrame as a RDD of JSON strings.

Since
1.3.0
def toJavaRDD: JavaRDD[Row]

Returns the content of the DataFrame as a JavaRDD of Rows.
Returns the content of the DataFrame as a JavaRDD of Rows.

Since
1.3.0
def toString(): String

Definition Classes
Queryable → AnyRef → Any
def transform[U](t: (DataFrame) ⇒ DataFrame): DataFrame

Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
```
def featurize(ds: DataFrame) = ...

df
  .transform(featurize)
  .transform(...)
```
Since
1.6.0
def unionAll(other: DataFrame): DataFrame

Returns a new DataFrame containing union of rows in this frame and another frame.
Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

Since
1.3.0
def unpersist(): DataFrame.this.type

Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

Since
1.3.0
def unpersist(blocking: Boolean): DataFrame.this.type

Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
blocking
Whether to block until all blocks are deleted.

Since
1.3.0
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def where(conditionExpr: String): DataFrame

Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
```
peopleDf.where("age > 15")
```
Since
1.5.0
def where(condition: Column): DataFrame

Filters rows using the given condition.
Filters rows using the given condition. This is an alias for filter.
```
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
```
Since
1.3.0
def withColumn(colName: String, col: Column): DataFrame

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Since
1.3.0
def withColumnRenamed(existingName: String, newName: String): DataFrame

Returns a new DataFrame with a column renamed.
Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName.

Since
1.3.0
def write: DataFrameWriter

:: Experimental :: Interface for saving the content of the DataFrame out into external storage.
:: Experimental :: Interface for saving the content of the DataFrame out into external storage.

Annotations
@Experimental()
Since
1.4.0

Deprecated Value Members

def createJDBCTable(url: String, table: String, allowExisting: Boolean): Unit

Save this DataFrame to a JDBC database at url under the table name table.
Save this DataFrame to a JDBC database at url under the table name table. This will run a CREATE TABLE and a bunch of INSERT INTO statements. If you pass true for allowExisting, it will drop any table with the given name; if you pass false, it will throw if the table already exists.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.
def insertInto(tableName: String): Unit

Adds the rows from this RDD to the specified table.
Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def insertInto(tableName: String, overwrite: Boolean): Unit

Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
Adds the rows from this RDD to the specified table, optionally overwriting the existing data.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit

Save this DataFrame to a JDBC database at url under the table name table.
Save this DataFrame to a JDBC database at url under the table name table. Assumes the table already exists and has a compatible schema. If you pass true for overwrite, it will TRUNCATE the table before performing the INSERTs.
The table must already exist on the database. It must have a schema that is compatible with the schema of this RDD; inserting the rows of the RDD in order via the simple statement INSERT INTO table VALUES (?, ?, ..., ?) should not fail.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.
def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

(Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options
(Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def save(source: String, mode: SaveMode, options: Map[String, String]): Unit

Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.
Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def save(path: String, source: String, mode: SaveMode): Unit

Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.
Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def save(path: String, source: String): Unit

Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.
Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.format(source).save(path). This will be removed in Spark 2.0.
def save(path: String, mode: SaveMode): Unit

Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.
Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.mode(mode).save(path). This will be removed in Spark 2.0.
def save(path: String): Unit

Saves the contents of this DataFrame to the given path, using the default data source configured by spark.
Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.save(path). This will be removed in Spark 2.0.
def saveAsParquetFile(path: String): Unit

Saves the contents of this DataFrame as a parquet file, preserving the schema.
Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the parquetFile function in SQLContext.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.parquet(path). This will be removed in Spark 2.0.
def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

(Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
(Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def saveAsTable(tableName: String, source: String, mode: SaveMode, options: Map[String, String]): Unit

Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def saveAsTable(tableName: String, source: String, mode: SaveMode): Unit

:: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
:: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0)
def saveAsTable(tableName: String, source: String): Unit

Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.
Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.format(source).saveAsTable(tableName). This will be removed in Spark 2.0.
def saveAsTable(tableName: String, mode: SaveMode): Unit

Creates a table from the the contents of this DataFrame, using the default data source configured by spark.
Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.mode(mode).saveAsTable(tableName). This will be removed in Spark 2.0.
def saveAsTable(tableName: String): Unit

Creates a table from the the contents of this DataFrame.
Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.
Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an insertInto.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.

Annotations
@deprecated
Deprecated
(Since version 1.4.0) Use write.saveAsTable(tableName). This will be removed in Spark 2.0.
def toSchemaRDD: DataFrame

Annotations
@deprecated
Deprecated
(Since version 1.3.0) Use toDF. This will be removed in Spark 2.0.

class DataFrame extends Queryable with Serializable

Instance Constructors

new DataFrame(sqlContext: SQLContext, logicalPlan: LogicalPlan)

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

def agg(expr: Column, exprs: Column*): DataFrame

def agg(exprs: Map[String, String]): DataFrame

def agg(exprs: Map[String, String]): DataFrame

def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

def alias(alias: Symbol): DataFrame

def alias(alias: String): DataFrame

def apply(colName: String): Column

def as(alias: Symbol): DataFrame

def as(alias: String): DataFrame

def as[U](implicit arg0: Encoder[U]): Dataset[U]

final def asInstanceOf[T0]: T0

def cache(): DataFrame.this.type

def clone(): AnyRef

def coalesce(numPartitions: Int): DataFrame

def col(colName: String): Column

def collect(): Array[Row]

def collectAsList(): List[Row]

def collectToPython(): Int

def columns: Array[String]

def count(): Long

def cube(col1: String, cols: String*): GroupedData

def cube(cols: Column*): GroupedData

def describe(cols: String*): DataFrame

def distinct(): DataFrame

def drop(col: Column): DataFrame

def drop(colName: String): DataFrame

def dropDuplicates(colNames: Array[String]): DataFrame

def dropDuplicates(colNames: Seq[String]): DataFrame

def dropDuplicates(): DataFrame

def dtypes: Array[(String, String)]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def except(other: DataFrame): DataFrame

def explain(): Unit

def explain(extended: Boolean): Unit

def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

def filter(conditionExpr: String): DataFrame

def filter(condition: Column): DataFrame

def finalize(): Unit

def first(): Row

def flatMap[R](f: (Row) ⇒ TraversableOnce[R])(implicit arg0: ClassTag[R]): RDD[R]

def foreach(f: (Row) ⇒ Unit): Unit

def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

final def getClass(): Class[_]

def groupBy(col1: String, cols: String*): GroupedData

def groupBy(cols: Column*): GroupedData

def hashCode(): Int

def head(): Row

def head(n: Int): Array[Row]

def inputFiles: Array[String]

def intersect(other: DataFrame): DataFrame

final def isInstanceOf[T0]: Boolean

def isLocal: Boolean

def javaRDD: JavaRDD[Row]

def javaToPython: JavaRDD[Array[Byte]]

def join(right: DataFrame, joinExprs: Column, joinType: String): DataFrame

def join(right: DataFrame, joinExprs: Column): DataFrame

def join(right: DataFrame, usingColumns: Seq[String], joinType: String): DataFrame

def join(right: DataFrame, usingColumns: Seq[String]): DataFrame

def join(right: DataFrame, usingColumn: String): DataFrame

def join(right: DataFrame): DataFrame

def limit(n: Int): DataFrame

val logicalPlan: LogicalPlan

def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]

def mapPartitions[R](f: (Iterator[Row]) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

def na: DataFrameNaFunctions

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit