Chunking#
Xorbits divides large datasets into multiple chunks, with each chunk executed independently using
single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too
many chunks can lead to a large computation graph, causing the supervisor to spend excessive time
on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks
that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and
chunking needs to align with both the computation and the available storage. Users familiar with
Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as repartition().
Automatically#
Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform repartition()
operations, as our chunking process occurs automatically in the background, transparent to the user.
This automatic chunking mechanism simplifies user interfaces (no more extra repartition code) and
optimizes performance (no more OOM issues). We call this process Dynamic Tiling. Interested
readers can refer to our research paper for more detailed
information.
Xorbits’ operator partitioning is referred to as tiling. We have a predefined option called
chunk_store_limit. This option controls the upper limit of each chunk. During the tiling
process, Xorbits calculates the data size incoming from upstream operators. Each chunk’s data size
is <= the chunk_store_limit. Any data exceeding the chunk_store_limit is
partitioned into a new chunk.
We have set this chunk_store_limit option to 512 * 1024 ** 2, which is equivalent to
512 M. It’s important to note that this value may not be optimal for all scenarios and workloads.
In CPU environments, setting this value higher may not yield substantial benefits even if you
have a large amount of RAM available. However, in GPU scenarios, it’s advisable to set this value
higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs.
You can set this value with a context: xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3}).
import xorbits.pandas as xpd
with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
# your xorbits code
Or you can set this value at the begining of your Python script: xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})
import xorbits.pandas as xpd
xpd.set_option("chunk_store_limit", 1024 ** 3)
# your xorbits code
Manually#
We recommend using either the xorbits.option_context() method or the xorbits.options
attribute mentioned above to configure the setting. If you wish to specify the number of chunks
(typically for debugging purposes), you can do so as follows by specifying the chunk_size
when creating a Xorbits DataFrame or Array.
import numpy as np
import pandas as pd
import xorbits.numpy as xnp
import xorbits.pandas as xpd
a = xnp.ones((100, 100), chunk_size=30)
data = pd.DataFrame(
np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13)
)
xdf = xpd.DataFrame(data, chunk_size=5)