API Reference

intake_parquet.source.ParquetSource(*args, ...)

Source to load parquet datasets.

class intake_parquet.source.ParquetSource(*args, **kwargs)[source]

Source to load parquet datasets.

Produces a dataframe.

A parquet dataset may be a single file, a set of files in a single directory or a nested set of directories containing data-files.

The implementation uses either fastparquet or pyarrow, select with the engine= kwarg.

Common keyword parameters accepted by this Source:

columns: list of str or None
column names to load. If None, loads all
filters: list of tuples
row-group level filtering; a tuple like ('x', '>', 1) would mean that if a row-group has a maximum value less than 1 for the column x, then it will be skipped. Row-level filtering is not performed.
engine: ‘fastparquet’ or ‘pyarrow’
Which backend to read with.
see pd.read_parquet and dd.read_parquet() for the other named parameters that can be passed through.

Attributes

cache
cache_dirs
cat
classname
description
dtype
entry
gui: Source GUI, with parameter selection and plotting
has_been_persisted
hvplot: Returns a hvPlot object to provide a high-level plotting API.
is_persisted
plot: Returns a hvPlot object to provide a high-level plotting API.
plots: List custom associated quick-plots
shape

Methods

`__call__`(**kwargs)	Create a new instance of this source with altered arguments
`close`()	Close open resources corresponding to this data source.
`configure_new`(**kwargs)	Create a new instance of this source with altered arguments
`describe`()	Description from the entry spec
`discover`()	Open resource and populate the source attributes.
`export`(path, **kwargs)	Save this data for sharing with other people
`get`(**kwargs)	Create a new instance of this source with altered arguments
`persist`([ttl])	Save data from this source to local persistent storage
`read`()	Create single pandas dataframe from the whole data-set
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Return a part of the data corresponding to i-th partition.
`to_dask`()	Return a dask container for this data source
`to_spark`()	Produce Spark DataFrame equivalent
`yaml`()	Return YAML representation of this data-source

get_persisted
set_cache_dir

read()[source]: Create single pandas dataframe from the whole data-set

to_dask()[source]: Return a dask container for this data source

to_spark()[source]

Produce Spark DataFrame equivalent

This will ignore all arguments except the urlpath, which will be directly interpreted by Spark. If you need to configure the storage, that must be done on the spark side.

This method requires intake-spark. See its documentation for how to set up a spark Session.