Chunking#

Xorbits divides large datasets into multiple chunks, with each chunk executed independently using single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too many chunks can lead to a large computation graph, causing the supervisor to spend excessive time on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and chunking needs to align with both the computation and the available storage. Users familiar with Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as repartition().

Automatically#

Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform repartition() operations, as our chunking process occurs automatically in the background, transparent to the user. This automatic chunking mechanism simplifies user interfaces (no more extra repartition code) and optimizes performance (no more OOM issues). We call this process Dynamic Tiling. Interested readers can refer to our research paper for more detailed information.

Xorbits’ operator partitioning is referred to as tiling. We have a predefined option called chunk_store_limit. This option controls the upper limit of each chunk. During the tiling process, Xorbits calculates the data size incoming from upstream operators. Each chunk’s data size is <= the chunk_store_limit. Any data exceeding the chunk_store_limit is partitioned into a new chunk.

We have set this chunk_store_limit option to 512 * 1024 ** 2, which is equivalent to 512 M. It’s important to note that this value may not be optimal for all scenarios and workloads. In CPU environments, setting this value higher may not yield substantial benefits even if you have a large amount of RAM available. However, in GPU scenarios, it’s advisable to set this value higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs.

You can set this value with a context: xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3}).

import xorbits.pandas as xpd

with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
    # your xorbits code

Or you can set this value at the begining of your Python script: xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})

import xorbits.pandas as xpd

xpd.set_option("chunk_store_limit", 1024 ** 3)
# your xorbits code

Manually#

We recommend using either the xorbits.option_context() method or the xorbits.options attribute mentioned above to configure the setting. If you wish to specify the number of chunks (typically for debugging purposes), you can do so as follows by specifying the chunk_size when creating a Xorbits DataFrame or Array.

import numpy as np
import pandas as pd
import xorbits.numpy as xnp
import xorbits.pandas as xpd

a = xnp.ones((100, 100), chunk_size=30)

data = pd.DataFrame(
    np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13)
)
xdf = xpd.DataFrame(data, chunk_size=5)

Chunking#

Automatically#

Manually#

This Page