A complete workflow for distributed fine-tuning of very large language models on consumer-grade multi-GPU setups. It combines three memory-saving techniques: FSDP shards model state across GPUs, 4-bit ...
I can corroborate the finding of @zch0414 below that there is no way to configure the trainer to force sync when using FSDP. As explained here, this is a big problem for memory intensive workloads. I ...