Fine-Tuning Llama 2 70B Using PyTorch FSDP

rw-book-cover

Metadata

Author: Hugging Face - Blog
Full Title: Fine-Tuning Llama 2 70B Using PyTorch FSDP
URL: https://huggingface.co/blog/ram-efficient-pytorch-fsdp

Highlights

Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed followed by discarding the shards from other devices. After the forward pass, the loss is computed followed by the backward pass. In the backward pass, each FSDP unit performs an all-gather operation to get the complete weights, with computation performed to get the local gradients. These local gradients are averaged and sharded across the devices via a reduce-scatter operation so that each device can update the parameters of its shard. For more information on what PyTorch FSDP is, please refer to this blog post: Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel. (View Highlight)
Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed followed by discarding the shards from other devices. After the forward pass, the loss is computed followed by the backward pass. In the backward pass, each FSDP unit performs an all-gather operation to get the complete weights, with computation performed to get the local gradients. These local gradients are averaged and sharded across the devices via a reduce-scatter operation so that each device can update the parameters of its shard. For more information on what PyTorch FSDP is, please refer to this blog post: Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel. (View Highlight)
Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed followed by discarding the shards from other devices. After the forward pass, the loss is computed followed by the backward pass. In the backward pass, each FSDP unit performs an all-gather operation to get the complete weights, with computation performed to get the local gradients. These local gradients are averaged and sharded across the devices via a reduce-scatter operation so that each device can update the parameters of its shard. For more information on what PyTorch FSDP is, please refer to this blog post: Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel. (View Highlight)

Pelayo Arbués

Explorer

Fine-Tuning Llama 2 70B Using PyTorch FSDP

Metadata

Highlights

Graph View

Table of Contents

Recent Notes

AI Enhanced Knowledge Management

Sucessful Model

The Rise of the Dataset Engineer

Now Reading

A Guide to Structured Generation Using Constrained Decoding

Pelayo Arbués

Explorer

Fine-Tuning Llama 2 70B Using PyTorch FSDP

Metadata §

Highlights §

Graph View

Table of Contents

Recent Notes

AI Enhanced Knowledge Management

Sucessful Model

The Rise of the Dataset Engineer

Now Reading

A Guide to Structured Generation Using Constrained Decoding

Metadata

Highlights