User Defined Functions (UDFs)#

daft.udf(f: Optional[Callable] = None, *, return_type: type, type_hints: Optional[dict[str, type]] = None, num_gpus: Optional[Union[int, float]] = None, num_cpus: Optional[Union[int, float]] = None, memory_bytes: Optional[Union[int, float]] = None) Callable[source]#

Decorator for creating a UDF. This decorator wraps any custom Python code into a funciton that can be used to process columns of data in a Daft DataFrame.

Note

UDFs are much slower than native Daft expressions because they run Python code instead of Daft’s optimized Rust kernels. You should only use UDFs when performing operations that are not supported by Daft’s native expressions, or when you need to run custom Python code. For example, the following UDF will be much slower than df["x"] + 100.

Example:

>>> @udf(return_type=int)
>>> def add_val(x, val=1):
>>>    # Your custom Python code here
>>>    return [x + 1 for value in x]

To invoke your UDF, you can use the DataFrame.with_column method:

>>> df = DataFrame.from_pydict({"x": [1, 2, 3]})
>>> df = df.with_column("x_add_100", add_val(df["x"], val=100))

Input/Return Types

By default, Daft will pass columns of data into your function as Python lists. However, if this is a bottleneck for your application, you may choose more optimized types for your inputs by annotating your function inputs with type hints.

In the following example, we annotate the x input parameter as an np.ndarray. Daft will now pass your data in as a Numpy array which is much more efficient to work with than a Python list for numerical operations.

>>> import numpy as np
>>>
>>> @udf(return_type=int)
>>> def add_val(x: np.ndarray, val: int = 1):
>>>     return x + val

Note also that Daft supports return types other than lists. In the above example, the returned value is a Numpy array as well.

Input and Return types supported by Daft UDFs and their respective type annotations:

  1. Numpy Arrays (np.ndarray)

  2. Pandas Series (pd.Series)

  3. Polars Series (polars.Series)

  4. PyArrow Arrays (pa.Array)

  5. Python lists (list or typing.List)

Note

Type annotation can be finicky in Python, depending on the version of Python you are using and if you are using typing functionality from future Python versions with from __future__ import annotations. Daft will alert you if it cannot infer types from your annotations, and you may choose to provide your types explicitly as a dictionary of input parameter name to its type in the @udf(type_hints=...) keyword argument.

Stateful UDFs

UDFs can also be created on Classes, which allow for initialization on some expensive state that can be shared between invocations of the class, for example downloading data or creating a model.

>>> @udf(return_type=int)
>>> class RunModel:
>>>     def __init__(self):
>>>         # Perform expensive initializations
>>>         self._model = create_model()
>>>
>>>     def __call__(self, features_col):
>>>         return self._model(features_col)
Parameters
  • f – Function to wrap as a UDF, accepts column inputs as Numpy arrays and returns a column of data as a Polars Series/Numpy array/Python list/Pandas series.

  • return_type – The return type of the UDF

  • type_hints – Optional dictionary of input parameter names to their types. If provided, this will override type hints provided using the function’s type annotations.

  • num_gpus – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead

  • num_cpus – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead

  • memory_bytes – Deprecated - please use DataFrame.with_column(…, resource_request=…) instead