Pelayo Arbués

Search

❯

Literature Notes

❯

How to Fine-Tune Google Gemma With ChatML and Hugging Face TRL

Mar 04, 2024, 1 min read

#articles
#literature-note

Metadata

Author: [[schmidphilipp1995@gmail.com (Philipp Schmid)]]
Full Title: How to Fine-Tune Google Gemma With ChatML and Hugging Face TRL
URL: https://www.philschmid.de/fine-tune-google-gemma

Highlights

Last week, Google released Gemma, a new family of state-of-the-art open LLMs. Gemma comes in two sizes: 7B parameters, for efficient deployment and development on consumer-size GPU and TPU and 2B versions for CPU and on-device applications. Both come in base and instruction-tuned variants. (View Highlight)
After the first week it seemed that Gemma is not very friendly to fine-tune using the ChatML format, which is adapted and used by the open soruce community, e.g. OpenHermes or Dolphin. I created this blog post to show you how to fine-tune Gemma using ChatML and Hugging Face TRL. (View Highlight)
If you are using a GPU with Ampere architecture (e.g. NVIDIA A10G or RTX 4090/3090) or newer you can use Flash attention. Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. The TL;DR; accelerates training up to 3x. Learn more at FlashAttention. (View Highlight)

Graph View

Metadata
Highlights

Recent Notes

AI Enhanced Knowledge Management
May 13, 2024
Sucessful Model
May 06, 2024
The Rise of the Dataset Engineer
Apr 25, 2024

See 87 more →

Now Reading

A Guide to Structured Generation Using Constrained Decoding
May 16, 2024

See 656 more →

Created with Quartz, © 2024

Twitter
Linkedin
Mastodon
Unsplash
GitHub
RSS