In June we saw an update to the NVMe standard. The update defines a software interface to assist in actually reading and writing to the drives in a way to which SSDs and NAND flash actually works.

Instead of emulating the traditional block device model that SSDs inherited from hard drives and earlier storage technologies, the new NVMe Zoned Namespaces optional feature allows SSDs to implement a different storage abstraction over flash memory. This is quite similar to the extensions SAS and SATA have added to accommodate Shingled Magnetic Recording (SMR) hard drives, with a few extras for SSDs. ‘Zoned’ SSDs with this new feature can offer better performance than regular SSDs, with less overprovisioning and less DRAM. The downside is that applications and operating systems have to be updated to support zoned storage, but that work is well underway.

The NVMe Zoned Namespaces (ZNS) specification has been ratified and published as a Technical Proposal. It builds on top of the current NVMe 1.4a specification, in preparation for NVMe 2.0. The upcoming NVMe 2.0 specification will incorporate all the approved Technical Proposals, but also reorganize that same functionality into multiple smaller component documents: a base specification, plus one for each command set (block, zoned, key-value, and potentially more in the future), and separate specifications for each transport protocol (PCIe, RDMA, TCP). The standardization of Zoned Namespaces clears the way for broader commercialization and adoption of this technology, which so far has been held back by vendor-specific zoned storage interfaces and very limited hardware choices.

Zoned Storage: An Overview

The fundamental challenge of using flash memory for a solid state drive is all of our computers are built around the concept of how hard drives work, and flash memory doesn't behave like a hard drive. Flash is organized very differently from a hard drive, and so optimizing our computers for the enhanced performance characteristics of flash memory will make it worth the trouble.

Magnetic platters are a fairly analog storage medium, with no inherent structure to dictate features like sector sizes. The long-lived standard of 512-byte sectors was chosen merely for convenience, and enterprise drives now support 4K byte sectors as we reach drive capacities in the multi-TB range. By contrast, a flash memory chip has several levels of structure baked into the design. The most important numbers are the page size and erase block size. Data can be read with page size granularity (typically on the order of several kB) and an empty page can be written to with a program operation, but erase operations clear an entire multi-MB block. The substantial size mismatch between read/program operations and erase operations is a complication that ordinary mechanical hard drives don't have to deal with. The limited program/erase cycle endurance of flash memory also adds to challenge, as writing fewer times increases the lifespan.

Almost all SSDs today are presented to software as an abstraction of a simple HDD-like block storage device with 512-byte or 4kB sectors. This hides all the complexities of SSDs that we’ve gone into detail over the years, such as page and erase block sizes, wear leveling and garbage collection. This abstraction is also part of why SSD controllers and firmware are so much bigger and more complicated (and more bug-prone) than hard drive controllers. For most purposes, the block device abstraction is still the right compromise, because it allows unmodified software to enjoy most of the performance benefits of flash memory, and the downsides like write amplification are manageable.

For years, the storage industry has been exploring alternatives to the block storage abstraction. There have been several proposals for Open Channel SSDs, which expose many of the gory details of flash memory directly to the host system, moving many of the responsibilities of SSD firmware over to software running on the host CPU. The various open channel SSD standards that have been promoted have struck different balances along the spectrum, between a typical SSD with a fully drive-managed flash translation layer (FTL) to a fully software-managed solution. The industry consensus was that some of the earliest standards, like the LightNVM 1.x specification, exposed too many details, requiring software to handle some differences between different vendors' flash memory, or between SLC, MLC, TLC, etc. Newer standards have sought to find a better balance and a level of abstraction that will allow for easier mass adoption while still allowing software to bypass the inefficiencies of a typical SSD.

Tackling the problem from the other direction, the NVMe standard has been gaining features that allow drives to share more information with the host about optimal patterns for data access and layout. For the most part, these are hints and optional features that software can take advantage of. This works because software that isn't aware of these features will still function as usual. Directives and Streams, NVM Sets, Predictable Latency Mode, and various alignment and granularity hints have all been added over the past few revisions of the NVMe specification to make it possible for software and SSDs to better cooperate.

Lately, a third approach has been gaining momentum, influenced by the hard drive market. Shingled Magnetic Recording (SMR) is a technique for increasing storage density by partially overlapping tracks on hard drive platters. The downside of this approach is that it's no longer possible to directly modify arbitrary bytes of data without corrupting adjacent overlapping tracks, so SMR hard drives group tracks into zones and only allow sequential writes within a zone. This has severe performance implications for workloads that include random writes, which is part of why drive-managed SMR hard drives have seen a mixed reception at best in the marketplace. However, in the server storage market, host-managed SMR is also a viable option: it requires the OS, filesystem and potentially the application software to be directly aware of zones, but making the necessary software changes is not an insurmountable challenge when working with a controlled environment.

The zoned storage model used for SMR hard drives turns out to also be a good fit for use with flash, and is a precursor to NVMe Zoned Namespaces. The zone-like structure of SMR hard drives mirrors the page and erase block structure of an SSD. The restrictions on writes aren't an exact match, but it comes close enough.

In this article, we’ll cover what NVMe Zoned Namespaces are, and why this is an important thing.

How to Enable NVMe Zoned Namespaces
Comments Locked

45 Comments

View All Comments

  • WorBlux - Wednesday, December 22, 2021 - link

    > Kinda seems like they introduced a whole new problem, there?

    Sort of, many of these drives are meant to be used with the API's directly accessible to your application. Which means the application now has to solve problems that the OS hand-waved away.

    If your application uses the filesystem API, only the filesystem has to worry about this. But if you want to application to leverage the determinism and parallelism available in the ZNS drives then it should be able to leverage the zone append command (which is the big advance over the zbc api of SMR hard drives)
  • Steven Wells - Saturday, August 8, 2020 - link

    So as a DRAM cost play this might save low single percentage point savings of the parts costs which seems like not enough motivation to consider. So clearly most the savings comes from reduced over provisioning of the flash needed to get get a similar write application and trading that against the extra lift required by the host. Curious if anyone has shared TCO studies on this to validate a clear cost savings for all the heavy lifting required by both drive and data center customer.
  • matias.bjorling - Monday, August 10, 2020 - link

    Thanks for the comprehensive write-up, Billy. It's great to see you writing about ZNS on Anandtech - I've followed it for 20 years! I haven't thought that my work on creating Open-Channel SSDs and Zoned Namespaces would one day be featured on Anandtech. Thanks for making it mainstream!
  • umeshm - Monday, August 24, 2020 - link

    This is the best explanation and analysis I have found on ZNS. Thank you, Billy!

    I have a question about how 4KB LBAs are persisted when the flash page size is 16KB and when 4 pages (QLC) are stored on each wordline.

    You mentioned that a 4KB LBA is partially programmed into flash page, but is vulnerable to read disturb. But my understanding so far is that only SLC supports partial-page programming. So a 4KB LBA would need to be buffered in NVRAM within the SSD until there is a full page (16KB) worth of data to write to a page. Then, the wordline is partially programmed, because the 3 other pages have not been programmed yet, so the wordline still needs to be protected through additional caching or buffering in NVRAM.

    Could you or someone else please either confirm or correct my understanding?

    Again, I really appreciate the effort and thinking that has gone into this article.

    Umesh
  • weilinzhu - Monday, June 20, 2022 - link

    very helpful article, thanks a lot! As you wrote: "A recent Western Digital demo with a 512GB ZNS prototype SSD showed the drive using a zone size of 256MB (for 2047 zones total) but also supporting 2GB zones." could you please kindly tell me where it was published! thanks in advance!!

Log in

Don't have an account? Sign up now