SKU/Artículo: AMZ-B0G2FY1FDW

DEEPSPEED IN PRODUCTION: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for real-time applications

Format:

Paperback

Hardcover

Kindle

Paperback

Detalles del producto
Disponibilidad:
En stock
Peso con empaque:
0.69 kg
Devolución:
Condición
Nuevo
Producto de:
Amazon
Viaja desde
USA

Sobre este producto
  • Run large language models with predictable latency, controlled cost, and production reliability.Shipping LLMs is an operational problem. Teams struggle with time to first token, tokens per second, GPU memory pressure, and a moving target of engines and datatypes. This book turns those issues into clear practices you can apply with DeepSpeed and the serving layers you already use.You get a practical path from checkpoint to stable API, with configuration that fits real workloads, not toy demos. Every topic is grounded in measurable outcomes so your stack meets SLOs under mixed traffic and budget constraints.place DeepSpeed correctly in your stack and configure kernel injection, tensor parallel, and ZeRO for real servicesunderstand TTFT and throughput from prefill to decode and set metrics for p95 latency and queue timesize and control the KV cache with paged attention, batching, and safe headroom targetsapply quantization that holds up under load, including w8a8, awq, gptq, fp8, and fp4use speculative decoding with a sound drafter choice, acceptance math, and stable fallbacksoperate vllm, tensorrt llm on triton, and tgi with clean api surfaces and core flagsscale with ray serve and plan capacity from workload shapes and arrival patternstune for nvidia hopper and blackwell or amd mi300x, with attention backends and nvlink planningrun on kubernetes with gpu operator, device plugin, mig, and topology aware placementwire observability with prometheus, dcgm, and opentelemetry spans, plus vllm bench, trtllm bench, and genai perfship safely with quotas, redaction, audit logs, go live gates, and instant rollback plansThis is a code heavy guide with working YAML, JSON, Shell, and Python examples that map directly to production, from gateway limits and network policies to rollout templates and exportable benchmark scripts.Grab your copy today and build an LLM service that stays fast, measurable, and dependable.
U$S 76,89
55% OFF
U$S 34,95

IMPORTÁ FACIL

Comprando este producto podrás descontar el IVA con tu número de RUT

NO CONSUME FRANQUICIA

Si tu carrito tiene solo libros o CD’s, no consume franquicia y podés comprar hasta U$S 1000 al año.

U$S 76,89
55% OFF
U$S 34,95
Llega en 5 a 11 días hábiles
con envío
Tienes garantía de entrega