pyspark.pandas.DataFrame.pipe¶
-
DataFrame.
pipe
(func: Callable[[…], Any], *args: Any, **kwargs: Any) → Any¶ Apply func(self, *args, **kwargs).
- Parameters
- func: function
function to apply to the DataFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the DataFrames.- args: iterable, optional
positional arguments passed into
func
.- kwargs: mapping, optional
a dictionary of keyword arguments passed into
func
.
- Returns
- object: the return type of
func
.
- object: the return type of
Notes
Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. For example, given>>> df = ps.DataFrame({'category': ['A', 'A', 'B'], ... 'col1': [1, 2, 3], ... 'col2': [4, 5, 6]}, ... columns=['category', 'col1', 'col2']) >>> def keep_category_a(df): ... return df[df['category'] == 'A'] >>> def add_one(df, column): ... return df.assign(col3=df[column] + 1) >>> def multiply(df, column1, column2): ... return df.assign(col4=df[column1] * df[column2])
instead of writing
>>> multiply(add_one(keep_category_a(df), column="col1"), column1="col2", column2="col3") category col1 col2 col3 col4 0 A 1 4 2 8 1 A 2 5 3 15
You can write
>>> (df.pipe(keep_category_a) ... .pipe(add_one, column="col1") ... .pipe(multiply, column1="col2", column2="col3") ... ) category col1 col2 col3 col4 0 A 1 4 2 8 1 A 2 5 3 15
If you have a function that takes the data as the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asdf
:>>> def multiply_2(column1, df, column2): ... return df.assign(col4=df[column1] * df[column2])
Then you can write
>>> (df.pipe(keep_category_a) ... .pipe(add_one, column="col1") ... .pipe((multiply_2, 'df'), column1="col2", column2="col3") ... ) category col1 col2 col3 col4 0 A 1 4 2 8 1 A 2 5 3 15
You can use lambda as well
>>> ps.Series([1, 2, 3]).pipe(lambda x: (x + 1).rename("value")) 0 2 1 3 2 4 Name: value, dtype: int64