API Reference¶

`intake_avro.source.AvroTableSource`(urlpath)	Source to load tabular Avro datasets.
`intake_avro.source.AvroSequenceSource`(urlpath)	Source to load Avro datasets as sequence of Python dicts.

class intake_avro.source.AvroTableSource(urlpath, blocksize=100000000, metadata=None, storage_options=None)[source]¶

Source to load tabular Avro datasets.

Parameters:

urlpath: str: Location of the data files; can include protocol and glob characters.
blocksize: int or None: Partition the input files by roughly this number of bytes. Actual partition sizes will depend on the inherent structure of the data files. If None, each input file will be one partition, no file scanning will be needed ahead of time
storage_options: dict or None: Parameters to pass on to the file-system backend

Attributes:

Methods

`close`()	Close open resources corresponding to this data source.
`discover`()	Open resource and populate the source attributes.
`read`()	Load entire dataset into a container and return it
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Return a part of the data corresponding to i-th partition.
`to_dask`()	Create lazy dask dataframe object
`to_spark`()	Pass URL to spark to load as a DataFrame
`yaml`([with_plugin])	Return YAML representation of this data-source

set_cache_dir

to_spark()[source]¶

Pass URL to spark to load as a DataFrame

Note that this requires org.apache.spark.sql.avro.AvroFileFormat to be installed in your spark classes.

This feature is experimental.

class intake_avro.source.AvroSequenceSource(urlpath, blocksize=100000000, metadata=None, storage_options=None)[source]¶

Source to load Avro datasets as sequence of Python dicts.

Parameters:

urlpath: str: Location of the data files; can include protocol and glob characters.
blocksize: int or None: Partition the input files by roughly this number of bytes. Actual partition sizes will depend on the inherent structure of the data files. If None, each input file will be one partition, no file scanning will be needed ahead of time
storage_options: dict or None: Parameters to pass on to the file-system backend

Attributes:

Methods

`close`()	Close open resources corresponding to this data source.
`discover`()	Open resource and populate the source attributes.
`read`()	Load entire dataset into a container and return it
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Return a part of the data corresponding to i-th partition.
`to_dask`()	Create lazy dask bag object
`to_spark`()	Provide an equivalent data object in Apache Spark
`yaml`([with_plugin])	Return YAML representation of this data-source

set_cache_dir