.. _chunking: ======== Chunking ======== Xorbits divides large datasets into multiple chunks, with each chunk executed independently using single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too many chunks can lead to a large computation graph, causing the supervisor to spend excessive time on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and chunking needs to align with both the computation and the available storage. Users familiar with Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as :code:`repartition()`. Automatically ------------- Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform :code:`repartition()` operations, as our chunking process occurs automatically in the background, transparent to the user. This automatic chunking mechanism simplifies user interfaces (no more extra :code:`repartition` code) and optimizes performance (no more OOM issues). We call this process **Dynamic Tiling**. Interested readers can refer to our `research paper `_ for more detailed information. Xorbits' operator partitioning is referred to as tiling. We have a predefined option called :code:`chunk_store_limit`. This option controls the upper limit of each chunk. During the tiling process, Xorbits calculates the data size incoming from upstream operators. Each chunk's data size is \<= the :code:`chunk_store_limit`. Any data exceeding the :code:`chunk_store_limit` is partitioned into a new chunk. We have set this :code:`chunk_store_limit` option to :code:`512 * 1024 ** 2`, which is equivalent to 512 M. It's important to note that this value may not be optimal for all scenarios and workloads. In CPU environments, setting this value higher may not yield substantial benefits even if you have a large amount of RAM available. However, in GPU scenarios, it's advisable to set this value higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs. You can set this value with a context: :code:`xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3})`. .. code-block:: python import xorbits.pandas as xpd with xpd.option_context({"chunk_store_limit": 1024 ** 3}): # your xorbits code Or you can set this value at the begining of your Python script: :code:`xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})` .. code-block:: python import xorbits.pandas as xpd xpd.set_option("chunk_store_limit", 1024 ** 3) # your xorbits code Manually -------- We recommend using either the :code:`xorbits.option_context()` method or the :code:`xorbits.options` attribute mentioned above to configure the setting. If you wish to specify the number of chunks (typically for debugging purposes), you can do so as follows by specifying the :code:`chunk_size` when creating a Xorbits DataFrame or Array. .. code-block:: python import numpy as np import pandas as pd import xorbits.numpy as xnp import xorbits.pandas as xpd a = xnp.ones((100, 100), chunk_size=30) data = pd.DataFrame( np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13) ) xdf = xpd.DataFrame(data, chunk_size=5)