Welcome to the discussion, @AI_Enthusiast! Memory management and training time are indeed critical for large models. Have you considered using tf.data API for efficient input pipelines? It can significantly improve throughput.
Also, for distributed training, TensorFlow offers several strategies like MirroredStrategy (for single-node, multi-GPU) and MultiWorkerMirroredStrategy (for multi-node training). Which hardware setup are you currently working with?