Skip to content

Files

Latest commit

Apr 7, 2020
4e8e8e1 · Apr 7, 2020

History

History
126 lines (93 loc) · 6.12 KB

0277-float16.md

File metadata and controls

126 lines (93 loc) · 6.12 KB

Float16

Introduction

Introduce the Float16 type conforming to the BinaryFloatingPoint and SIMDScalar protocols, binding the IEEE 754 binary16 format (aka float16, half-precision, or half), and bridged by the compiler to the C _Float16 type.

Motivation

The last decade has seen a dramatic increase in the use of floating-point types smaller than (32-bit) Float. The most widely implemented is Float16, which is used extensively on mobile GPUs for computation, as a pixel format for HDR images, and as a compressed format for weights in ML applications.

Introducing the type to Swift is especially important for interoperability with shader-language programs; users frequently need to set up data structures on the CPU to pass to their GPU programs. Without the type available in Swift, they are forced to use unsafe mechanisms to create these structures.

In addition, C APIs that use these types simply cannot be imported, making them unavailable in Swift.

Proposed solution

Add Float16 to the standard library.

Detailed design

There is shockingly little to say here. We will add:

@frozen
struct Float16: BinaryFloatingPoint, SIMDScalar, CustomStringConvertible { }

The entire API falls out from that, with no additional surface outside that provided by those protocols. Float16 will provide exactly the operations that Float and Double and Float80 do for their conformance to these protocols.

We also need to ensure that the parameter passing conventions followed by the compiler for Float16 are what we want; these values should be passed and returned in the floating-point registers, and vectors should be passed and returned in SIMD registers.

On platforms that do not have native arithmetic support, we will convert Float16 to Float and use the hardware support for Float to perform the operation. This is correctly-rounded for every operation except fused multiply-add. A software sequence will be used to emulate fused multiply-add in these cases (the easiest option is to convert to Double, but other options may be more efficient on some architectures, especially for vectors).

Source compatibility

N/A

Effect on ABI stability

There is no change to existing ABI. We will be introducing a new type, which will have appropriate availability annotations.

Effect on API resilience

The Float16 type would become part of the public API. It will be @frozen, so no further changes will be possible, but its API and layout are entirely constrained by IEEE 754 and conformance to BinaryFloatingPoint, so there are no alternatives possible anyway.

Alternatives considered

Q: Why isn't it called Half?

A: FloatN is the more consistent pattern. Swift already has Float32, Float64 and Float80 (with Float and Double as alternative spellings of Float32 and Float64, respectively). At some future point we will add Float128. Plus, the C language type that this will bridge to is named _Float16.

Half is not completely outrageous as an alias, but we shouldn't add aliases unless there's a really compelling reason.

During the pitch phase, feedback was fairly overwhelmingly in favor of Float16, though there are a few people who would like to have both names available. Unless significantly more people come forward, however, we should make the "opinionated" choice to have a single name for the type. An alias could always be added with a subsequent minor proposal if necessary.

Q: What about ARM's "alternative half precision"? What about bfloat16?

A: Alternative half-precision is no longer supported; ARMv8.2-FP16 only uses the IEEE 754 encoding. Bfloat is something that we should also support eventually, but it's a separate proposal. and we should wait a little while before doing it. Conformance to IEEE 754 fully defines the semantics of Float16 (and several hardware vendors, including Apple have been shipping processors that implement those semantics for a few years). By contrast, there are a few proposals for hardware that implements bfloat16; ARM and Intel designed different multiply-add units for it, and haven't shipped yet (and haven't defined any operations other than a non-homogeneous multiply-add). Other companies have HW in use, but haven't (publicly) formally specified their arithmetic. It's a moving target, and it would be a mistake to attempt to specify language bindings today.

Q: Do we need conformance to BinaryFloatingPoint? What if we made it a storage-only format?

A: We could make it a type that can only be used to convert to/from Float and Double, forcing all arithmetic to be performed in another format. However, this would mean that it would be much harder, in some cases, to get the same numerical result on the CPU and GPU (when GPU computation is performed in half-precision). It's a very nice convenience to be able to do a computation in the REPL and see what a GPU is going to do.

Q: Why not put it in Swift Numerics?

A: The biggest reason to add the type is for interoperability with C-family and GPU programs that want to use their analogous types. In order to maximize the support for that interoperability, and to get the calling conventions that we want to have in the long-run, it makes more sense to put this type in the standard library.

Q: What about math library support?

A: If this proposal is approved, I will add conformance to Real in Swift Numerics, providing the math library functions (by using the corresponding Float implementations initially).