DUCKDB FOR DATA SCIENTISTS: PYTHON ANALYTICS WITHOUT INFRASTRUCTURE: Process Gigabytes Locally with SQL, Pandas Integration and Real Data Science Projects
Format:
Hardcover
En stock
0.77 kg
Sí
Nuevo
Amazon
USA
- Run serious analytics on your laptop with DuckDB, Python, and modern file formats to process gigabytes of data without infrastructure.Many data scientists feel stuck between slow notebooks and heavyweight data warehouses. Local CSV and pandas workflows crumble once datasets grow, but spinning up clusters, managing credentials, and waiting on shared resources is often overkill for day to day analytics and modeling.This book shows how to make DuckDB the center of a fast, reliable, fully local workflow. You will use standard SQL, tight Python integration, and columnar formats like Parquet to query, transform, and model real datasets on a single machine while still integrating cleanly with tools such as pandas, Arrow, dbt, Ibis, and scikit learn.Decide when DuckDB is the right tool for a project, how to choose between in memory and file backed databases, and how to pin versions so your examples stay stable across environments.Use core SQL every day, including joins, aggregations, filters that push down to files, window functions for feature engineering, ASOF joins for time series, and practical patterns for JSON columns, lists, and structs.Build a scalable Python workflow with the relational API, composing pipelines lazily, handing data off to pandas and Arrow with minimal copying, and streaming large results in manageable batches instead of materializing everything at once.Ingest real data formats without friction, from large CSVs with tricky headers and encodings to newline JSON over HTTP, and design Parquet layouts with row group sizing, compression, and partitioning that keep later queries fast.Work with remote and federated sources using httpfs and Secrets Manager to reach S3, Azure, and GCS, query Postgres and SQLite through scanners, move hot subsets into local Parquet, and use MotherDuck as a cloud companion when collaboration is required.Tune execution with an engineer focused view by understanding vector size, pipelines, and parallel operators, setting memory limits and temp directories, controlling threads, profiling with EXPLAIN and the query profiler, and using zone maps, clustering, and secondary indexes for data pruning.Apply lakehouse patterns locally by writing and reading Apache Iceberg tables from DuckDB, reading Delta tables with clear expectations and limits, exporting databases to well partitioned Parquet layouts, and handling database encryption keys and secrets in a safe, repeatable way.Increase productivity with extensions and tooling, including full text search with BM25 ranking for document search, spatial analytics for distances and areas, dbt for local builds, tests, and artifacts, Ibis for lazy data frames, and DuckDB WASM for browser based analytics over public Parquet.Work through full projects, including NYC taxi demand modeling on a laptop with monthly Parquet over HTTP, and an OpenAlex scholarly graph project that mines JSON snapshots, computes institution and author metrics, and builds JSON full text search with ranked, snippet based results, then finish with production focused chapters on tuning checklists, concurrency and single writer rules, safe ingestion patterns, testing and CI with pinned versions and golden outputs, and practical recovery from schema drift and out of memory conditions.This is a code heavy guide with DuckDB SQL, Python, JavaScript, and JSON examples that you can run as is to build working pipelines, dashboards, and reproducible local analytics projects.Grab your copy today and build fast, reliable DuckDB analytics that stay entirely under your control.
IMPORTÁ FACIL
Comprando este producto podrás descontar el IVA con tu número de RUT
NO CONSUME FRANQUICIA
Si tu carrito tiene solo libros o CD’s, no consume franquicia y podés comprar hasta U$S 1000 al año.