API Reference

intake_parquet.source.ParquetSource(*args, …)

Source to load parquet datasets.

class intake_parquet.source.ParquetSource(*args, **kwargs)[source]

Source to load parquet datasets.

Produces a dataframe.

A parquet dataset may be a single file, a set of files in a single directory or a nested set of directories containing data-files.

The implementation uses either fastparquet or pyarrow, select with the engine= kwarg.

Keyword parameters accepted by this Source:

  • columns: list of str or None

    column names to load. If None, loads all

  • index: str or None

    column to make into the index of the dataframe. If None, may be inferred from the saved matadata in certain cases.

  • filters: list of tuples

    row-group level filtering; a tuple like ('x', '>', 1) would mean that if a row-group has a maximum value less than 1 for the column x, then it will be skipped. Row-level filtering is not performed.

  • engine: ‘fastparquet’ or ‘pyarrow’

    Which backend to read with.

  • gather_statisticsbool or None (default).

    Gather the statistics for each dataset partition. By default, this will only be done if the _metadata file is available. Otherwise, statistics will only be gathered if True, because the footer of every file will be parsed (which is very slow on some systems).

  • see dd.read_parquet() for the other named parameters that can be passed through.

Attributes
cache_dirs
classname
datashape
description
entry
gui

Source GUI, with parameter selection and plotting

has_been_persisted
hvplot

Returns a hvPlot object to provide a high-level plotting API.

is_persisted
plot

Returns a hvPlot object to provide a high-level plotting API.

plots

List custom associated quick-plots

Methods

__call__(**kwargs)

Create a new instance of this source with altered arguments

close()

Close open resources corresponding to this data source.

configure_new(**kwargs)

Create a new instance of this source with altered arguments

discover()

Open resource and populate the source attributes.

export(path, **kwargs)

Save this data for sharing with other people

get(**kwargs)

Create a new instance of this source with altered arguments

persist([ttl])

Save data from this source to local persistent storage

read()

Create single pandas dataframe from the whole data-set

read_chunked()

Return iterator over container fragments of data source

read_partition(i)

Return a part of the data corresponding to i-th partition.

to_dask()

Return a dask container for this data source

to_spark()

Produce Spark DataFrame equivalent

yaml([with_plugin])

Return YAML representation of this data-source

describe

get_persisted

set_cache_dir

read()[source]

Create single pandas dataframe from the whole data-set

to_dask()[source]

Return a dask container for this data source

to_spark()[source]

Produce Spark DataFrame equivalent

This will ignore all arguments except the urlpath, which will be directly interpreted by Spark. If you need to configure the storage, that must be done on the spark side.

This method requires intake-spark. See its documentation for how to set up a spark Session.