In Turing (and Volta), each SM subcore can execute 1 32-wide bundle of instructions per clock. However, it can only complete 16 FP32 operations per clock. One might wonder why NVIDIA would hobble their throughput like this. The answer is the ability to also execute 16 INT32 operations per clock, which was likely expected to prevent the rest of the SM (with 32-wide datapaths) from sitting idle. Evidently, this was not borne out.
Therefore, Ampere has moved back to executing 32 FP32 operations per SM subcore per clock. What NVIDIA has done is more accurately described as “undoing the 1/2 FP32 rate”.
As for performance, each Ampere “CUDA core” will be substantially less performant than in Turing because the other structures in the SM have not been doubled.