DataFrame#

DataFrame

A Daft DataFrame is a table of data.

Note

Most DataFrame methods are lazy, meaning that they do not execute computation immediately when invoked. Instead, these operations are enqueued in the DataFrame’s internal query plan, and are only executed when Execution DataFrame methods are called.

Construction#

From Files#

daft.DataFrame.read_csv

Creates a DataFrame from CSV file(s)

daft.DataFrame.read_json

Creates a DataFrame from line-delimited JSON file(s)

daft.DataFrame.read_parquet

Creates a DataFrame from Parquet file(s)

daft.DataFrame.from_glob_path

Creates a DataFrame of file paths and other metadata from a glob path

From In-Memory Data#

daft.DataFrame.from_pylist

Creates a DataFrame from a list of dictionaries

daft.DataFrame.from_pydict

Creates a DataFrame from a Python dictionary

Data Manipulation#

Manipulating Columns#

daft.DataFrame.select

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT

daft.DataFrame.with_column

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one

daft.DataFrame.exclude

Drops columns from the current DataFrame by name

daft.DataFrame.explode

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows

Filtering Rows#

daft.DataFrame.distinct

Computes unique rows, dropping duplicates

daft.DataFrame.where

Filters rows via a predicate expression, similar to SQL WHERE.

daft.DataFrame.limit

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT

Reordering#

daft.DataFrame.sort

Sorts DataFrame globally

daft.DataFrame.repartition

Repartitions DataFrame to num partitions

Combining#

daft.DataFrame.join

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN

Aggregations#

daft.DataFrame.groupby

Performs a GroupBy on the DataFrame for aggregation

daft.DataFrame.sum

Performs a global sum on the DataFrame

daft.DataFrame.mean

Performs a global mean on the DataFrame

daft.DataFrame.count

Performs a global count on the DataFrame

daft.DataFrame.min

Performs a global min on the DataFrame

daft.DataFrame.max

Performs a global max on the DataFrame

daft.DataFrame.agg

Perform aggregations on this DataFrame.

Execution#

Note

These methods will execute the operations in your DataFrame and are blocking.

Materialization#

daft.DataFrame.collect

Executes the entire DataFrame and materializes the results

Visualization#

daft.DataFrame.show

Executes enough of the DataFrame in order to display the first n rows

Writing Data#

daft.DataFrame.write_parquet

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written

daft.DataFrame.write_csv

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written

Integrations#

daft.DataFrame.to_pandas

Converts the current DataFrame to a pandas DataFrame.

daft.DataFrame.to_ray_dataset

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray

Schema and Lineage#

daft.DataFrame.explain

Prints the LogicalPlan that will be executed to produce this DataFrame.

daft.DataFrame.schema

Returns the Schema of the DataFrame, which provides information about each column

daft.DataFrame.column_names

Returns column names of DataFrame as a list of strings.