Dask DataFrame An Introduction
#############################
Video Source: www.youtube.com/watch?v=AT2XtFehFSQ
In this video, Matt Rocklin gives a brief introduction to Dask DataFrames. • Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations. • Dask helps you scale your data science and machine learning workflows and also makes it easy to work with Numpy, pandas, and Scikit-Learn. Dask is a framework to build distributed applications that has been used with dozens of other systems like XGBoost, PyTorch, Prefect, Airflow, RAPIDS, and more. • Dask DataFrames scale pandas workflows, enabling applications in time series, business intelligence, and general data munging on big data. A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames. Dask Dataframes coordinate many Pandas dataframes, partitioned along with an index. • Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines. Because the dask.dataframe application programming interface (API) is a subset of the Pandas API, it should be familiar to Pandas users. • Share your feedback with us in the comments and let us know: • Did you find the video helpful? • Have you used Dask before? • Learn more at https://docs.dask.org/en/latest/dataf... • KEY MOMENTS • 00:00 - Intro • 00:15 - Start with Pandas • 01:22 - Dask DataFrames • 02:26 - Multiple files • 03:14 - Dask DataFrame Partitions • 04:33 - Mapping a Function Across All Partitions • 06:35 - Metadata • 06:46 - Parquet
#############################