Significant progress has been made into this issue. Major pain-points in the save process have been identified and solutions are being worked on. The core of the issue is saves sometimes take too long to happen or fail for whatever reason, and eventually time-out.
We have identified performance issues with our storage system. We are currently working with the devs of that system to help us tune it better for our load/make performance improvements to the program itself. We are rolling out updates to that program, and also changing the format of the disk the data is saved to to a format that is better tailored to our load. This is currently being rolled out, but is unfortunately just a long process that will take some time. On top of that, just generally rolling out more hardware to take on the extra load.
Also on the other side, we are re-doing the save process itself to be more reliable and faster. That is currently in late dev/testing phases and should be rolled out soon.
This issue is our top-priority right now and is being tirelessly worked on.