daft.DataFrame#

class daft.DataFrame(plan: daft.logical.logical_plan.LogicalPlan)[source]#

A Daft DataFrame is a table of data. It has columns, where each column has a type and the same number of items (rows) as all other columns.

__init__(plan: daft.logical.logical_plan.LogicalPlan) None[source]#

Constructs a DataFrame according to a given LogicalPlan. Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

Parameters

plan – LogicalPlan describing the steps required to arrive at this DataFrame

Methods

__init__(plan)

Constructs a DataFrame according to a given LogicalPlan.

agg(to_agg)

Perform aggregations on this DataFrame.

collect([num_preview_rows])

Executes the entire DataFrame and materializes the results

count(*cols)

Performs a global count on the DataFrame

count_rows()

Executes the Dataframe to count the number of rows.

distinct()

Computes unique rows, dropping duplicates

exclude(*names)

Drops columns from the current DataFrame by name

explain([show_optimized])

Prints the LogicalPlan that will be executed to produce this DataFrame.

explode(*columns)

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows

from_csv(*args, **kwargs)

from_files(path)

Creates a DataFrame of file paths and other metadata from a glob path

from_glob_path(path)

Creates a DataFrame of file paths and other metadata from a glob path

from_json(*args, **kwargs)

from_parquet(*args, **kwargs)

from_pydict(data)

Creates a DataFrame from a Python dictionary

from_pylist(data)

Creates a DataFrame from a list of dictionaries

groupby(*group_by)

Performs a GroupBy on the DataFrame for aggregation

join(other[, on, left_on, right_on, how])

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN

limit(num)

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT

max(*cols)

Performs a global max on the DataFrame

mean(*cols)

Performs a global mean on the DataFrame

min(*cols)

Performs a global min on the DataFrame

num_partitions()

plan()

Returns LogicalPlan that will be executed to compute the result of this DataFrame.

read_csv(path[, has_headers, column_names, ...])

Creates a DataFrame from CSV file(s)

read_json(path)

Creates a DataFrame from line-delimited JSON file(s)

read_parquet(path)

Creates a DataFrame from Parquet file(s)

repartition(num, *partition_by)

Repartitions DataFrame to num partitions

schema()

Returns the Schema of the DataFrame, which provides information about each column

select(*columns)

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT

show([n])

Executes enough of the DataFrame in order to display the first n rows

sort(by[, desc])

Sorts DataFrame globally

sum(*cols)

Performs a global sum on the DataFrame

to_pandas()

Converts the current DataFrame to a pandas DataFrame.

to_pydict()

Converts the current DataFrame to a python dictionary.

to_ray_dataset()

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray

where(predicate)

Filters rows via a predicate expression, similar to SQL WHERE.

with_column(column_name, expr[, ...])

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one

write_csv(root_dir[, partition_cols])

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written

write_parquet(root_dir[, compression, ...])

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written

Attributes

column_names

Returns column names of DataFrame as a list of strings.

columns

Returns column of DataFrame as a list of Expressions.