Advanced filesystem usage
The filesystem source provides the building blocks to load data from files. This section explains how you can customize the filesystem source for your use case.
Standalone filesystem resource
You can use the standalone filesystem resource to list files in cloud storage or a local filesystem. This allows you to customize file readers or manage files using fsspec.
from dlt.sources.filesystem import filesystem
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
files = filesystem(bucket_url="s3://my_bucket/data", file_glob="csv_folder/*.csv")
pipeline.run(files)
The filesystem ensures consistent file representation across bucket types and offers methods to access and read data. You can quickly build pipelines to:
- Extract text from PDFs (unstructured data source).
- Stream large file content directly from buckets.
- Copy files locally (copy files)
FileItem
representation
- All dlt sources/resources that yield files follow the FileItem contract.
- File content is typically not loaded (you can control it with the
extract_content
parameter of the filesystem resource). Instead, full file info and methods to access content are available. - Users can request an authenticated fsspec AbstractFileSystem instance.
FileItem
fields
file_url
- complete URL of the file (e.g.,s3://bucket-name/path/file
). This field serves as a primary key.file_name
- name of the file from the bucket URL.relative_path
- set when doingglob
, is a relative path to abucket_url
argument.mime_type
- file's MIME type. It is sourced from the bucket provider or inferred from its extension.modification_date
- file's last modification time (format:pendulum.DateTime
).size_in_bytes
- file size.file_content
- content, provided upon request.
When using a nested or recursive glob pattern, relative_path
will include the file's path relative to bucket_url
. For instance, using the resource: filesystem("az://dlt-ci-test-bucket/standard_source/samples", file_glob="met_csv/A801/*.csv")
will produce file names relative to the /standard_source/samples
path, such as met_csv/A801/A881_20230920.csv
. For local filesystems, POSIX paths (using "/" as separator) are returned.