{
"cells": [
{
"cell_type": "markdown",
"id": "ccf4ea1c",
"metadata": {},
"source": [
"# Dask\n",
"\n",
"Dask performs two different tasks:\n",
"1. it optimizes dynamic task scheduling, similar to [Airflow](https://airflow.apache.org/), [Luigi](https://github.com/spotify/luigi) or [Celery](https://docs.celeryq.dev/en/stable/).\n",
"2. it performs parallel data like arrays, dataframes, and lists with dynamic task scheduling."
]
},
{
"cell_type": "markdown",
"id": "d6041d52",
"metadata": {},
"source": [
"## Scales from laptops to clusters\n",
"\n",
"Dask can be installed on a laptop with uv and expands the size of the datasets from *fits in memory* to *fits on disk*. Dask can also scale to a cluster of hundreds of machines. It is resilient, elastic, data-local and has low latency. For more information, see the [distributed scheduler](https://distributed.dask.org/en/latest/) documentation. This simple transition between a single machine and a cluster allows users to both start easily and grow as needed."
]
},
{
"cell_type": "markdown",
"id": "4e332597",
"metadata": {},
"source": [
"## Install Dask\n",
"\n",
"You can install everything that is required for most common applications of Dask (arrays, dataframes, …). This installs both Dask and dependencies such as NumPy, Pandas, etc. that are required for various workloads:\n",
"\n",
"``` bash\n",
"$ uv add \"dask[complete]\"\n",
"```\n",
"\n",
"However, only individual subsets can be installed with:\n",
"\n",
"``` bash\n",
"$ uv add \"dask[array]\"\n",
"$ uv add \"dask[dataframe]\"\n",
"$ uv add \"dask[diagnostics]\"\n",
"$ uv add \"dask[distributed]\"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "cb34d835",
"metadata": {},
"source": [
"### Testing the installation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6d1b3a12",
"metadata": {},
"outputs": [],
"source": [
"import pytest"
]
},
{
"cell_type": "markdown",
"id": "7d944110",
"metadata": {},
"source": [
"## Familiar operation"
]
},
{
"cell_type": "markdown",
"id": "5acd42b5",
"metadata": {},
"source": [
"### Dask DataFrame\n",
"\n",
"… imitates pandas."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cfafa039-413d-46ed-ad61-2fb8beb7a4c5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" 2021-12 | \n",
" 2022-01 | \n",
" 2022-02 | \n",
"
\n",
" \n",
" | Title | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Jupyter Tutorial | \n",
" 0.5 | \n",
" 18103.5 | \n",
" 20505.5 | \n",
" 13099.0 | \n",
"
\n",
" \n",
" | PyViz Tutorial | \n",
" 2.0 | \n",
" 4873.0 | \n",
" 3930.0 | \n",
" 2573.0 | \n",
"
\n",
" \n",
" | Python Basics | \n",
" 4.5 | \n",
" 261.0 | \n",
" 251.0 | \n",
" 341.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 2021-12 2022-01 2022-02\n",
"Title \n",
"Jupyter Tutorial 0.5 18103.5 20505.5 13099.0\n",
"PyViz Tutorial 2.0 4873.0 3930.0 2573.0\n",
"Python Basics 4.5 261.0 251.0 341.0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"\n",
"df = pd.read_csv(\"tutorials.csv\")\n",
"grouped = df.groupby(\"Title\")\n",
"grouped.agg(\"mean\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ebefec46",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" 2021-12 | \n",
" 2022-01 | \n",
" 2022-02 | \n",
"
\n",
" \n",
" | Title | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Jupyter Tutorial | \n",
" 0.5 | \n",
" 18103.5 | \n",
" 20505.5 | \n",
" 13099.0 | \n",
"
\n",
" \n",
" | PyViz Tutorial | \n",
" 2.0 | \n",
" 4873.0 | \n",
" 3930.0 | \n",
" 2573.0 | \n",
"
\n",
" \n",
" | Python Basics | \n",
" 4.5 | \n",
" 261.0 | \n",
" 251.0 | \n",
" 341.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 2021-12 2022-01 2022-02\n",
"Title \n",
"Jupyter Tutorial 0.5 18103.5 20505.5 13099.0\n",
"PyViz Tutorial 2.0 4873.0 3930.0 2573.0\n",
"Python Basics 4.5 261.0 251.0 341.0"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import dask.dataframe as dd\n",
"\n",
"\n",
"dd = pd.read_csv(\"tutorials.csv\")\n",
"dd.groupby(\"Title\").agg(\"mean\")"
]
},
{
"cell_type": "markdown",
"id": "e23f8eee",
"metadata": {},
"source": [
"\n",
"\n",
"**See also:**\n",
"\n",
"* [Dask DataFrame Docs](https://docs.dask.org/en/latest/dataframe.html)\n",
"* [Dask DataFrame Best Practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "bd4cdd77",
"metadata": {},
"source": [
"### Dask Array\n",
"\n",
"… imitates NumPy."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7c11efbb",
"metadata": {},
"outputs": [],
"source": [
"import h5py\n",
"import numpy as np\n",
"\n",
"\n",
"f = h5py.File(\"mydata.h5\")\n",
"x = np.array(f[\".\"])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c1b99ffd",
"metadata": {},
"outputs": [],
"source": [
"import dask.array as da\n",
"\n",
"\n",
"f = h5py.File(\"mydata.h5\")\n",
"x = da.array(f[\".\"])"
]
},
{
"cell_type": "markdown",
"id": "530d413b",
"metadata": {},
"source": [
"\n",
"\n",
"**See also:**\n",
"\n",
"* [Dask Array Docs](https://docs.dask.org/en/latest/array.html)\n",
"* [Dask Array Best Practices](https://docs.dask.org/en/latest/array-best-practices.html)\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "601abdeb",
"metadata": {},
"source": [
"### Dask Bag\n",
"\n",
"… imitates [iterators](https://docs.python.org/3/library/itertools.html), [Toolz](https://toolz.readthedocs.io/en/latest/index.html) und [PySpark](https://spark.apache.org/docs/latest/api/python/)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "023e30e1-90c6-4ebc-aeef-3bd286ac29ad",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[11, 10]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import dask.bag as db\n",
"\n",
"\n",
"b = db.from_sequence([10, 3, 5, 7, 11, 4])\n",
"list(b.topk(2))"
]
},
{
"cell_type": "markdown",
"id": "f3f50983",
"metadata": {},
"source": [
"\n",
"\n",
"**See also:**\n",
"\n",
"* [Dask Bag Docs](https://docs.dask.org/en/latest/bag.html)\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "b1b1b4d3",
"metadata": {},
"source": [
"### Dask Delayed\n",
"\n",
"… imitates loops and wraps custom code, see [Creating a delayed pipeline](../clean-prep/dask-pipeline.ipynb#5.-Creating-a-delayed-pipeline)."
]
},
{
"cell_type": "markdown",
"id": "20d967a4",
"metadata": {},
"source": [
"\n",
"\n",
"**See also:**\n",
"\n",
"* [Dask Delayed Docs](https://docs.dask.org/en/latest/delayed.html)\n",
"* [Dask Delayed Best Practices](https://docs.dask.org/en/latest/delayed-best-practices.html)\n",
"* [Dask pipeline example: Tracking the International Space Station with Dask](../clean-prep/dask-pipeline.ipynb)\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "3f4a6539",
"metadata": {},
"source": [
"## `concurrent.futures`\n",
"\n",
"… extends Python’s `concurrent.futures` interface and enables the submission of user-defined tasks.\n",
"\n",
"\n",
"\n",
"**Note:**\n",
"\n",
"For the following example, Dask must be installed with the `distributed` option, for example\n",
"\n",
"``` bash\n",
"$ uv add dask[distributed]\n",
"```\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "5b173a8f-4506-435b-af7b-a8dfa5cf5548",
"metadata": {},
"outputs": [],
"source": [
"from dask.distributed import Client"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c6aaa0aa-a4ac-4e73-b1ac-fd24801d4ca8",
"metadata": {},
"outputs": [],
"source": [
"client = Client()"
]
},
{
"cell_type": "markdown",
"id": "843cbaf5-b9e2-4d17-bc18-adf64f79064e",
"metadata": {},
"source": [
"This starts local workers as processes. To run the local workers as threads, you can pass `processes=False` as a parameter:\n",
"\n",
"``` python\n",
"client = Client(processes=False)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f2c6281c-54c1-4abc-be5b-004be390ae29",
"metadata": {},
"source": [
"Now you can execute your own tasks and chain dependencies using the `submit` method:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "57c1fe57-7bbd-471a-93c4-c13b06dd9614",
"metadata": {},
"outputs": [],
"source": [
"from math import pi\n",
"\n",
"\n",
"def inc(x):\n",
" return x + 1\n",
"\n",
"def circumference(x):\n",
" return 2 * pi * x\n",
"\n",
"increments = client.submit(inc, 10)\n",
"circumferences = client.submit(circumference, increments)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "df7d8dee-9b39-4260-82aa-475f8aaec00f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"69.11503837897544"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"circumferences.result()"
]
},
{
"cell_type": "markdown",
"id": "e2d4fa5e",
"metadata": {},
"source": [
"\n",
"\n",
"**See also:**\n",
"\n",
"* [Dask Futures Docs](https://docs.dask.org/en/latest/futures.html)\n",
"* [Dask Futures Quickstart](https://distributed.dask.org/en/latest/quickstart.html)\n",
"* [Dask Futures Examples](https://examples.dask.org/futures.html)\n",
"
"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.13 Kernel",
"language": "python",
"name": "python313"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.0"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}