A constructor that automatically analyzes the logical plan.
A constructor that automatically analyzes the logical plan.
This reports error eagerly as the DataFrame is constructed, unless SQLConf.dataFrameEagerAnalysis is turned off.
Aggregates on the entire DataFrame without groups.
Aggregates on the entire DataFrame without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(max($"age"), avg($"salary")) df.groupBy().agg(max($"age"), avg($"salary"))
1.3.0
(Java-specific) Aggregates on the entire DataFrame without groups.
(Java-specific) Aggregates on the entire DataFrame without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(Map("age" -> "max", "salary" -> "avg")) df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
1.3.0
(Scala-specific) Aggregates on the entire DataFrame without groups.
(Scala-specific) Aggregates on the entire DataFrame without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg(Map("age" -> "max", "salary" -> "avg")) df.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
1.3.0
(Scala-specific) Aggregates on the entire DataFrame without groups.
(Scala-specific) Aggregates on the entire DataFrame without groups.
// df.agg(...) is a shorthand for df.groupBy().agg(...) df.agg("age" -> "max", "salary" -> "avg") df.groupBy().agg("age" -> "max", "salary" -> "avg")
1.3.0
(Scala-specific) Returns a new DataFrame with an alias set.
(Scala-specific) Returns a new DataFrame with an alias set. Same as as
.
1.6.0
Returns a new DataFrame with an alias set.
Returns a new DataFrame with an alias set. Same as as
.
1.6.0
Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b
.
1.3.0
(Scala-specific) Returns a new DataFrame with an alias set.
(Scala-specific) Returns a new DataFrame with an alias set.
1.3.0
Returns a new DataFrame with an alias set.
Returns a new DataFrame with an alias set.
1.3.0
:: Experimental ::
Converts this DataFrame to a strongly-typed Dataset containing objects of the
specified type, U
.
:: Experimental ::
Converts this DataFrame to a strongly-typed Dataset containing objects of the
specified type, U
.
1.6.0
Persist this DataFrame with the default storage level (MEMORY_AND_DISK
).
Persist this DataFrame with the default storage level (MEMORY_AND_DISK
).
1.3.0
Returns a new DataFrame that has exactly numPartitions
partitions.
Returns a new DataFrame that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
1.4.0
Selects column based on the column name and return it as a Column.
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b
.
1.3.0
Returns an array that contains all of Rows in this DataFrame.
Returns an array that contains all of Rows in this DataFrame.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.3.0
Returns a Java list that contains all of Rows in this DataFrame.
Returns all column names as an array.
Returns all column names as an array.
1.3.0
Returns the number of rows in the DataFrame.
Returns the number of rows in the DataFrame.
1.3.0
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. df.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. df.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.4.0
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group. df.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. df.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.4.0
Computes statistics for numeric columns, including count, mean, stddev, min, and max.
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame. If you want to
programmatically compute summary statistics, use the agg
function instead.
df.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
1.3.1
Returns a new DataFrame that contains only the unique rows from this DataFrame.
Returns a new DataFrame with a column dropped.
Returns a new DataFrame with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the DataFrame doesn't have a column with an equivalent expression.
1.4.1
Returns a new DataFrame with a column dropped.
Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.
1.4.0
Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
1.4.0
(Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new DataFrame with duplicate rows removed, considering only the subset of columns.
1.4.0
Returns a new DataFrame that contains only the unique rows from this DataFrame.
Returns all column names and their data types as an array.
Returns all column names and their data types as an array.
1.3.0
Returns a new DataFrame containing rows in this frame but not in another frame.
Returns a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT
in SQL.
1.3.0
Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.
1.3.0
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
1.3.0
(Scala-specific) Returns a new DataFrame where a single column has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new DataFrame where a single column has been expanded to zero
or more rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. All
columns of the input row are implicitly joined with each value that is output by the function.
df.explode("words", "word"){words: String => words.split(" ")}
1.3.0
(Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new DataFrame where each row has been expanded to zero or more
rows by the provided function. This is similar to a LATERAL VIEW
in HiveQL. The columns of
the input row are implicitly joined with each row that is output by the function.
The following example uses this function to count the number of books which contain a given word:
case class Book(title: String, words: String) val df: RDD[Book] case class Word(word: String) val allWords = df.explode('words) { case Row(words: String) => words.split(" ").map(Word(_)) } val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))
1.3.0
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDf.filter("age > 15")
1.3.0
Filters rows using the given condition.
Filters rows using the given condition.
// The following are equivalent: peopleDf.filter($"age" > 15) peopleDf.where($"age" > 15)
1.3.0
Returns the first row.
Returns the first row. Alias for head().
1.3.0
Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results.
1.3.0
Applies a function f
to all rows.
Applies a function f
to all rows.
1.3.0
Applies a function f to each partition of this DataFrame.
Applies a function f to each partition of this DataFrame.
1.3.0
Groups the DataFrame using the specified columns, so we can run aggregation on them.
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. df.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. df.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.3.0
Groups the DataFrame using the specified columns, so we can run aggregation on them.
Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department. df.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. df.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.3.0
Returns the first row.
Returns the first row.
1.3.0
Returns the first n
rows.
Returns the first n
rows.
1.3.0
Returns a best-effort snapshot of the files that compose this DataFrame.
Returns a best-effort snapshot of the files that compose this DataFrame. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
Returns a new DataFrame containing rows only in both this frame and another frame.
Returns a new DataFrame containing rows only in both this frame and another frame.
This is equivalent to INTERSECT
in SQL.
1.3.0
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
1.3.0
Converts a JavaRDD to a PythonRDD.
Converts a JavaRDD to a PythonRDD.
Join with another DataFrame, using the given join expression.
Join with another DataFrame, using the given join expression. The following performs
a full outer join between df1
and df2
.
// Scala: import org.apache.spark.sql.functions._ df1.join(df2, $"df1Key" === $"df2Key", "outer") // Java: import static org.apache.spark.sql.functions.*; df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
Right side of the join.
Join expression.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
1.3.0
Inner join with another DataFrame, using the given join expression.
Inner join with another DataFrame, using the given join expression.
// The following two are equivalent: df1.join(df2, $"df1Key" === $"df2Key") df1.join(df2).where($"df1Key" === $"df2Key")
1.3.0
Equi-join with another DataFrame using the given columns.
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
1.6.0
Inner equi-join with another DataFrame using the given columns.
Inner equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the columns "user_id" and "user_name" df1.join(df2, Seq("user_id", "user_name"))
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Names of the columns to join on. This columns must exist on both sides.
1.4.0
Inner equi-join with another DataFrame using the given column.
Inner equi-join with another DataFrame using the given column.
Different from other join functions, the join column will only appear once in the output,
i.e. similar to SQL's JOIN USING
syntax.
// Joining df1 and df2 using the column "user_id" df1.join(df2, "user_id")
Note that if you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
Right side of the join operation.
Name of the column to join on. This column must exist on both sides.
1.4.0
Cartesian join with another DataFrame.
Cartesian join with another DataFrame.
Note that cartesian joins are very expensive without an extra filter that can be pushed down.
Right side of the join operation.
1.3.0
Returns a new DataFrame by taking the first n
rows.
Returns a new RDD by applying a function to all rows of this DataFrame.
Returns a new RDD by applying a function to all rows of this DataFrame.
1.3.0
Returns a new RDD by applying a function to each partition of this DataFrame.
Returns a new RDD by applying a function to each partition of this DataFrame.
1.3.0
Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
// Dropping rows containing any null values.
df.na.drop()
1.3.1
Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions.
This is an alias of the sort
function.
1.3.0
Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions.
This is an alias of the sort
function.
1.3.0
Persist this DataFrame with the given storage level.
Persist this DataFrame with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.
1.3.0
Persist this DataFrame with the default storage level (MEMORY_AND_DISK
).
Persist this DataFrame with the default storage level (MEMORY_AND_DISK
).
1.3.0
Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.
1.3.0
Randomly splits this DataFrame with the provided weights.
Randomly splits this DataFrame with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
1.4.0
Randomly splits this DataFrame with the provided weights.
Randomly splits this DataFrame with the provided weights.
weights for splits, will be normalized if they don't sum to 1.
Seed for sampling.
1.4.0
Registers this DataFrame as a temporary table using the given name.
Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.
1.3.0
Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions.
Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
1.6.0
Returns a new DataFrame partitioned by the given partitioning expressions into
numPartitions
.
Returns a new DataFrame partitioned by the given partitioning expressions into
numPartitions
. The resulting DataFrame is hash partitioned.
This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
1.6.0
Returns a new DataFrame that has exactly numPartitions
partitions.
Returns a new DataFrame that has exactly numPartitions
partitions.
1.3.0
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolluped by department and group. df.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender. df.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.4.0
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
// Compute the average for all numeric columns rolluped by department and group. df.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender. df.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
1.4.0
Returns a new DataFrame by sampling a fraction of rows, using a random seed.
Returns a new DataFrame by sampling a fraction of rows, using a random seed.
Sample with replacement or not.
Fraction of rows to generate.
1.3.0
Returns a new DataFrame by sampling a fraction of rows.
Returns a new DataFrame by sampling a fraction of rows.
Sample with replacement or not.
Fraction of rows to generate.
Seed for sampling.
1.3.0
Returns the schema of this DataFrame.
Selects a set of columns.
Selects a set of columns. This is a variant of select
that can only select
existing columns using column names (i.e. cannot construct expressions).
// The following two are equivalent: df.select("colA", "colB") df.select($"colA", $"colB")
1.3.0
Selects a set of column based expressions.
Selects a set of column based expressions.
df.select($"colA", $"colB" + 1)
1.3.0
Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of select
that accepts
SQL expressions.
// The following are equivalent: df.selectExpr("colA", "colB as newName", "abs(colC)") df.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
1.3.0
Displays the DataFrame in a tabular form.
Displays the DataFrame in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.5.0
Displays the top 20 rows of DataFrame in a tabular form.
Displays the top 20 rows of DataFrame in a tabular form.
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.5.0
Displays the top 20 rows of DataFrame in a tabular form.
Displays the top 20 rows of DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
1.3.0
Displays the DataFrame in a tabular form.
Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
1.3.0
Returns a new DataFrame sorted by the given expressions.
Returns a new DataFrame sorted by the given expressions. For example:
df.sort($"col1", $"col2".desc)
1.3.0
Returns a new DataFrame sorted by the specified column, all in ascending order.
Returns a new DataFrame sorted by the specified column, all in ascending order.
// The following 3 are equivalent df.sort("sortcol") df.sort($"sortcol") df.sort($"sortcol".asc)
1.3.0
Returns a new DataFrame with each partition sorted by the given expressions.
Returns a new DataFrame with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
1.6.0
Returns a new DataFrame with each partition sorted by the given expressions.
Returns a new DataFrame with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
1.6.0
Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
// Finding frequent items in column with name 'a'. df.stat.freqItems(Seq("a"))
1.4.0
Returns the first n
rows in the DataFrame.
Returns the first n
rows in the DataFrame.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
1.3.0
Returns the first n
rows in the DataFrame as a list.
Returns the first n
rows in the DataFrame as a list.
Running take requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
1.6.0
Returns a new DataFrame with columns renamed.
Returns a new DataFrame with columns renamed. This can be quite convenient in conversion from a RDD of tuples into a DataFrame with meaningful names. For example:
val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name _1 and _2 rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
1.3.0
Returns the object itself.
Returns the object itself.
1.3.0
Returns the content of the DataFrame as a RDD of JSON strings.
Returns the content of the DataFrame as a RDD of JSON strings.
1.3.0
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
def featurize(ds: DataFrame) = ...
df
.transform(featurize)
.transform(...)
1.6.0
Returns a new DataFrame containing union of rows in this frame and another frame.
Returns a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL
in SQL.
1.3.0
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
1.3.0
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
Whether to block until all blocks are deleted.
1.3.0
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDf.where("age > 15")
1.5.0
Filters rows using the given condition.
Filters rows using the given condition. This is an alias for filter
.
// The following are equivalent: peopleDf.filter($"age" > 15) peopleDf.where($"age" > 15)
1.3.0
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
1.3.0
Returns a new DataFrame with a column renamed.
Returns a new DataFrame with a column renamed. This is a no-op if schema doesn't contain existingName.
1.3.0
:: Experimental :: Interface for saving the content of the DataFrame out into external storage.
:: Experimental :: Interface for saving the content of the DataFrame out into external storage.
1.4.0
Save this DataFrame to a JDBC database at url
under the table name table
.
Save this DataFrame to a JDBC database at url
under the table name table
.
This will run a CREATE TABLE
and a bunch of INSERT INTO
statements.
If you pass true
for allowExisting
, it will drop any table with the
given name; if you pass false
, it will throw if the table already
exists.
(Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.
Adds the rows from this RDD to the specified table.
Adds the rows from this RDD to the specified table. Throws an exception if the table already exists.
(Since version 1.4.0)
Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
Adds the rows from this RDD to the specified table, optionally overwriting the existing data.
(Since version 1.4.0)
Save this DataFrame to a JDBC database at url
under the table name table
.
Save this DataFrame to a JDBC database at url
under the table name table
.
Assumes the table already exists and has a compatible schema. If you
pass true
for overwrite
, it will TRUNCATE
the table before
performing the INSERT
s.
The table must already exist on the database. It must have a schema
that is compatible with the schema of this RDD; inserting the rows of
the RDD in order via the simple statement
INSERT INTO table VALUES (?, ?, ..., ?)
should not fail.
(Since version 1.4.0) Use write.jdbc(). This will be removed in Spark 2.0.
(Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options
(Scala-specific) Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options
(Since version 1.4.0)
Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.
Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options.
(Since version 1.4.0)
Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.
Saves the contents of this DataFrame to the given path based on the given data source and SaveMode specified by mode.
(Since version 1.4.0)
Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.
Saves the contents of this DataFrame to the given path based on the given data source, using SaveMode.ErrorIfExists as the save mode.
(Since version 1.4.0) Use write.format(source).save(path). This will be removed in Spark 2.0.
Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.
Saves the contents of this DataFrame to the given path and SaveMode specified by mode, using the default data source configured by spark.sql.sources.default.
(Since version 1.4.0) Use write.mode(mode).save(path). This will be removed in Spark 2.0.
Saves the contents of this DataFrame to the given path, using the default data source configured by spark.
Saves the contents of this DataFrame to the given path, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.
(Since version 1.4.0) Use write.save(path). This will be removed in Spark 2.0.
Saves the contents of this DataFrame as a parquet file, preserving the schema.
Saves the contents of this DataFrame as a parquet file, preserving the schema.
Files that are written out using this method can be read back in as a DataFrame
using the parquetFile
function in SQLContext.
(Since version 1.4.0) Use write.parquet(path). This will be removed in Spark 2.0.
(Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
(Scala-specific) Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0)
Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0)
:: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
:: Experimental :: Creates a table at the given path from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0)
Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.
Creates a table at the given path from the the contents of this DataFrame based on a given data source and a set of options, using SaveMode.ErrorIfExists as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0) Use write.format(source).saveAsTable(tableName). This will be removed in Spark 2.0.
Creates a table from the the contents of this DataFrame, using the default data source configured by spark.
Creates a table from the the contents of this DataFrame, using the default data source configured by spark.sql.sources.default and SaveMode.ErrorIfExists as the save mode.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0) Use write.mode(mode).saveAsTable(tableName). This will be removed in Spark 2.0.
Creates a table from the the contents of this DataFrame.
Creates a table from the the contents of this DataFrame. It will use the default data source configured by spark.sql.sources.default. This will fail if the table already exists.
Note that this currently only works with DataFrames that are created from a HiveContext as
there is no notion of a persisted catalog in a standard SQL context. Instead you can write
an RDD out to a parquet file, and then register that file as a table. This "table" can then
be the target of an insertInto
.
When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
(Since version 1.4.0) Use write.saveAsTable(tableName). This will be removed in Spark 2.0.
(Since version 1.3.0) Use toDF. This will be removed in Spark 2.0.
:: Experimental :: A distributed collection of data organized into named columns.
A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.
To select a column from the data frame, use
apply
method in Scala andcol
in Java.Note that the Column type can also be manipulated through its various functions.
A more concrete example in Scala:
and in Java:
1.3.0