- Proposal: SE-0277
- Author: Stephen Canon
- Review Manager: Ben Cohen
- Status: Implemented (Swift 5.3)
- Implementation: apple/swift#30130
- Decision Notes: Rationale
Introduce the Float16 type conforming to the BinaryFloatingPoint and SIMDScalar
protocols, binding the IEEE 754 binary16 format (aka float16, half-precision, or half),
and bridged by the compiler to the C _Float16 type.
- Old pitch thread: Add
Float16. - New pitch thread: Add Float16
The last decade has seen a dramatic increase in the use of floating-point types smaller
than (32-bit) Float. The most widely implemented is Float16, which is used
extensively on mobile GPUs for computation, as a pixel format for HDR images, and as
a compressed format for weights in ML applications.
Introducing the type to Swift is especially important for interoperability with shader-language programs; users frequently need to set up data structures on the CPU to pass to their GPU programs. Without the type available in Swift, they are forced to use unsafe mechanisms to create these structures.
In addition, C APIs that use these types simply cannot be imported, making them unavailable in Swift.
Add Float16 to the standard library.
There is shockingly little to say here. We will add:
@frozen
struct Float16: BinaryFloatingPoint, SIMDScalar, CustomStringConvertible { }
The entire API falls out from that, with no additional surface outside that provided by those
protocols. Float16 will provide exactly the operations that Float and Double and Float80
do for their conformance to these protocols.
We also need to ensure that the parameter passing conventions followed by the compiler
for Float16 are what we want; these values should be passed and returned in the
floating-point registers, and vectors should be passed and returned in SIMD registers.
On platforms that do not have native arithmetic support, we will convert Float16 to
Float and use the hardware support for Float to perform the operation. This is
correctly-rounded for every operation except fused multiply-add. A software sequence
will be used to emulate fused multiply-add in these cases (the easiest option is to convert
to Double, but other options may be more efficient on some architectures, especially
for vectors).
N/A
There is no change to existing ABI. We will be introducing a new type, which will have appropriate availability annotations.
The Float16 type would become part of the public API. It will be @frozen, so no
further changes will be possible, but its API and layout are entirely constrained by
IEEE 754 and conformance to BinaryFloatingPoint, so there are no alternatives
possible anyway.
Q: Why isn't it called Half?
A: FloatN is the more consistent pattern. Swift already has Float32,
Float64 and Float80 (with Float and Double as alternative spellings of Float32
and Float64, respectively). At some future point we will add Float128. Plus, the C
language type that this will bridge to is named _Float16.
Half is not completely outrageous as an alias, but we shouldn't add aliases unless
there's a really compelling reason.
During the pitch phase, feedback was fairly overwhelmingly in favor of Float16, though
there are a few people who would like to have both names available. Unless significantly
more people come forward, however, we should make the "opinionated" choice to have a single
name for the type. An alias could always be added with a subsequent minor proposal if
necessary.
Q: What about ARM's "alternative half precision"? What about bfloat16?
A: Alternative half-precision is no longer supported; ARMv8.2-FP16 only uses the IEEE 754
encoding. Bfloat is something that we should also support eventually, but it's a separate
proposal. and we should wait a little while before doing it. Conformance to IEEE 754 fully
defines the semantics of Float16 (and several hardware vendors, including Apple have
been shipping processors that implement those semantics for a few years). By contrast,
there are a few proposals for hardware that implements bfloat16; ARM and Intel designed
different multiply-add units for it, and haven't shipped yet (and haven't defined any
operations other than a non-homogeneous multiply-add). Other companies have HW in
use, but haven't (publicly) formally specified their arithmetic. It's a moving target, and it
would be a mistake to attempt to specify language bindings today.
Q: Do we need conformance to BinaryFloatingPoint? What if we made it a storage-only format?
A: We could make it a type that can only be used to convert to/from Float and Double,
forcing all arithmetic to be performed in another format. However, this would mean that
it would be much harder, in some cases, to get the same numerical result on the CPU and
GPU (when GPU computation is performed in half-precision). It's a very nice convenience
to be able to do a computation in the REPL and see what a GPU is going to do.
Q: Why not put it in Swift Numerics?
A: The biggest reason to add the type is for interoperability with C-family and GPU programs that want to use their analogous types. In order to maximize the support for that interoperability, and to get the calling conventions that we want to have in the long-run, it makes more sense to put this type in the standard library.
Q: What about math library support?
A: If this proposal is approved, I will add conformance to Real in Swift Numerics, providing
the math library functions (by using the corresponding Float implementations initially).