How is it intended to do multi-threading with this library? #7284

jonded94 · 2025-03-13T15:07:38Z

Which part is this question about
Library API / UX

Describe your question
With a lot of functions in the pyarrow package, there is already some multithreading implemented for you.
As far as I understand, reading from a file for example is multithreaded by letting each column be processed by a separate thread.

As far as I know, there is nothing comparable to that directly available for you in this Rust crate. What are users expected to do here?

For example, for simply reading a parquet file in a parallized manner, would one do something like this?

first look at the schema to find out what the columns are
spawn an async worker for each column that reads from the same file, but with a filter for just one column
collect all RecordBatches from each worker and merge it into one RecordBatch containing all the data

If there is nothing already offered for you in this crate that does this, should this maybe be part of this crate?

How could parallelized writes work? It's not easily possible to just write parquet files containing one column each and then merge it afterwards, right?

The text was updated successfully, but these errors were encountered:

tustvold · 2025-03-13T15:16:04Z

As arrow-rs is designed to be embedded in various different environments, it makes no assumptions about runtime environment, instead providing the raw primitives for people to use as appropriate.

For reading parquet, row groups can be decoded in parallel. See https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.new_with_metadata

For writing parquet see https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html

These can be used with a threadpool like rayon or similar.

If you'd prefer a more batteries included experience I would recommend looking at a fully fledged query engine, such as DataFusion, that wires these primitives up for you.

spawn an async worker for each column that reads from the same file, but with a filter for just one column

This is relatively non-trivial because of the way parquet and arrow's representation of nested data differ, however, one could use projections to process disjoint sets of columns in parallel. Processing row groups in parallel is the more common approach.

jonded94 · 2025-03-13T19:28:27Z

Thank you a lot @tustvold , this is very helpful! Will have a look into all of this, wasn't aware that it's possible to split up writing columns to individual workers.

alamb · 2025-03-14T17:03:52Z

DataFusion's writer includes multi-threaded writing of parquet files, FWIW

As well as reads the parquet files in parallel

jonded94 · 2025-03-14T17:27:03Z

@alamb where would I find this exactly?

I found this for example: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.write_table

But I'd need something like a "PathBuf" -> "RecordBatch" and vice versa interface basically.

alamb · 2025-03-15T12:14:15Z

@alamb where would I find this exactly?

I found this for example: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.write_table

But I'd need something like a "PathBuf" -> "RecordBatch" and vice versa interface basically.

What are you trying to do?

Perhaps this is what you are looking for? https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.write_parquet

jonded94 added the question Further information is requested label Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is it intended to do multi-threading with this library? #7284

How is it intended to do multi-threading with this library? #7284

jonded94 commented Mar 13, 2025 •

edited

Loading

tustvold commented Mar 13, 2025 •

edited

Loading

jonded94 commented Mar 13, 2025

alamb commented Mar 14, 2025 •

edited

Loading

jonded94 commented Mar 14, 2025

alamb commented Mar 15, 2025

How is it intended to do multi-threading with this library? #7284

How is it intended to do multi-threading with this library? #7284

Comments

jonded94 commented Mar 13, 2025 • edited Loading

tustvold commented Mar 13, 2025 • edited Loading

jonded94 commented Mar 13, 2025

alamb commented Mar 14, 2025 • edited Loading

jonded94 commented Mar 14, 2025

alamb commented Mar 15, 2025

jonded94 commented Mar 13, 2025 •

edited

Loading

tustvold commented Mar 13, 2025 •

edited

Loading

alamb commented Mar 14, 2025 •

edited

Loading