User Defined Functions (UDFs)
User Defined Functions (UDFs)#
- daft.udf(f: Optional[Callable] = None, *, return_type: type, type_hints: Optional[dict[str, type]] = None, num_gpus: Optional[Union[int, float]] = None, num_cpus: Optional[Union[int, float]] = None, memory_bytes: Optional[Union[int, float]] = None) Callable [source]#
Decorator for creating a UDF. This decorator wraps any custom Python code into a funciton that can be used to process columns of data in a Daft DataFrame.
Note
UDFs are much slower than native Daft expressions because they run Python code instead of Daft’s optimized Rust kernels. You should only use UDFs when performing operations that are not supported by Daft’s native expressions, or when you need to run custom Python code. For example, the following UDF will be much slower than
df["x"] + 100
.Example:
>>> @udf(return_type=int) >>> def add_val(x, val=1): >>> # Your custom Python code here >>> return [x + 1 for value in x]
To invoke your UDF, you can use the
DataFrame.with_column
method:>>> df = DataFrame.from_pydict({"x": [1, 2, 3]}) >>> df = df.with_column("x_add_100", add_val(df["x"], val=100))
Input/Return Types
By default, Daft will pass columns of data into your function as Python lists. However, if this is a bottleneck for your application, you may choose more optimized types for your inputs by annotating your function inputs with type hints.
In the following example, we annotate the
x
input parameter as annp.ndarray
. Daft will now pass your data in as a Numpy array which is much more efficient to work with than a Python list for numerical operations.>>> import numpy as np >>> >>> @udf(return_type=int) >>> def add_val(x: np.ndarray, val: int = 1): >>> return x + val
Note also that Daft supports return types other than lists. In the above example, the returned value is a Numpy array as well.
Input and Return types supported by Daft UDFs and their respective type annotations:
Numpy Arrays (
np.ndarray
)Pandas Series (
pd.Series
)Polars Series (
polars.Series
)PyArrow Arrays (
pa.Array
)Python lists (
list
ortyping.List
)
Note
Type annotation can be finicky in Python, depending on the version of Python you are using and if you are using typing functionality from future Python versions with
from __future__ import annotations
. Daft will alert you if it cannot infer types from your annotations, and you may choose to provide your types explicitly as a dictionary of input parameter name to its type in the@udf(type_hints=...)
keyword argument.Stateful UDFs
UDFs can also be created on Classes, which allow for initialization on some expensive state that can be shared between invocations of the class, for example downloading data or creating a model.
>>> @udf(return_type=int) >>> class RunModel: >>> def __init__(self): >>> # Perform expensive initializations >>> self._model = create_model() >>> >>> def __call__(self, features_col): >>> return self._model(features_col)
- Parameters
f – Function to wrap as a UDF, accepts column inputs as Numpy arrays and returns a column of data as a Polars Series/Numpy array/Python list/Pandas series.
return_type – The return type of the UDF
type_hints – Optional dictionary of input parameter names to their types. If provided, this will override type hints provided using the function’s type annotations.
num_gpus – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead
num_cpus – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead
memory_bytes – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead