Great point @close. So many concepts of host managed storage all come down to “but Mr Drive Vendor this is fully reliable, right? Even if I totally screw up the host side management?”
Well nothing is 100% reliable and you should always plan on the possibility of it failing. But for example "jailbreaking" the controller to use some form of custom firmware that would increase performance almost certainly at the cost of reliability is really like shooting all of your storage at its feet to make it run faster. Sooner or later you'll hit the foot.
Go for something that's validated and proven if your data matters. There's no guarantee, just a better likelihood of not loosing it.
I think Radian might actually be doing that. In general though, adding a feature of this size and complexity in a firmware update is something your customers will want to put through a full QA/qualification cycle before deploying. And by the time a customer is done updating their software stack to use ZNS, it's probably time to buy (or start qualifying) new SSDs anyways.
Why spend the money to make a retroactive firmware, when you can just sell the user a new drive with the updated spec? If someone cares enough about this, they'll shell out the $$$ for a new drive.
Does it really? I know Intel made a big deal about this, but isn't the reality (not that it changes your point, but getting the technical details right) - the minimum Optane granularity unit is a 64B line (which, admittedly, is the effective same as DRAM, but DRAM could be smaller if necessary, Optane???)
- the PRACTICAL Optane granularity unit (which is what I am getting at in terms of "in-place"), giving 4x the bandwidth, is 256B.
Yeah, I'm right. Looking around I found this https://www.usenix.org/system/files/fast20-yang.pd... which says "the 3D-XPoint physical media access granularity is 256 bytes" with everything that flows from that: need for write combining buffers, RMW if you can't write-combine, write amplification power/lifetime concerns, etc etc.
So, sure, you don't have BIG zones/pages like flash -- but it's also incorrect (both technically, and for optimal use of the technology) to suggest that it's "true" random access, as much so as DRAM.
It remains unclear to me how much of the current disappointment around Optane DIMM performance, eg https://www.extremetech.com/computing/288855-repor... derives from this. Certainly the Optane-targeted algorithms and new file systems I was reading say 5 years ago, when Intel was promising essentially "flash density, RAM performance" seemed very much optimized for "true" random access with no attempts at clustering larger than a cache line. Wouldn't be the first time Intel advertising department's lies landed up tanking a technology because of the ultimate gap between what was promised (and designed for) vs what was delivered...
Optane is byte addressable like DRAM and fairly durable, isn't it? I don't think this "multi kilobyte zoned storage" approach would be any more appropriate than the spinning rust block/sector model.
Then again, running Optane over PCIe/NVMe always seemed like a waste to me.
"Optane is byte addressable like DRAM and fairly durable, isn't it?"
yes, and my first notion was that Optane would *replace* DRAM/HDD/SSD in a 'true' 64 bit address single level storage space. although slower than DRAM, such an architecture would write program variables as they change direct to 'storage' without all that data migration. completely forgot that current cpu use many levels of buffers between registers and durable storage. iow, there's really no byte addressed update in today's machines.
back in the 70s and early 80s, TI (and some others, I think) built machines that had no data registers in/on the cpu, all instructions happened in main memory and all data was written directly in memory and then to disc. the morphing to load/store architectures with scads of buffering means that optimum use of an Optane store with such an architecture looks to be a waste of time until/if cpu architecture writes data based on transaction scope of applications, not buffer fill.
The early 70s and 80s timeframe saw CPUs and Memory scaling roughly the same, year to year. After a while, memory advanced a whole lot slower, necessitating the multiple tiers of memory we have now, from L1 cache to HDD. Modern CPUs didn't become lots of SRAM with at attached ALU just because CPU designers love throwing their transistor budget into measly megabytes of cache. They became that way, simply because other tiers of memory and storage are just too slow.
Modern CPU's have instruction that let you skip cache, and then there was SPARC with streaming accelerators, where you could unleash a true vector/CUDA style instruction directly against a massive chunk of memory.
An excellent article; readable and interesting even to those (like me) who don't know the tech but with depth for those who do. Right on the AT target.
The first 512-sectors I remember is going back to the days of IBM XT compatibles, 5¼ inch floppies, 20MB HDD, MSDOS, FAT12 & FAT16. That well over 30 years of baggage is heavy to carry around. They moved to 32bit based file systems and 4KB blocks/clusters or larger (eg. 64 or 128bit addresses, 2MB blocks/clusters are possible).
It wastes space to save small files/fragments in large blocks but it also wastes resources to handle more locations (smaller blocks) with longer addresses taking up more space and processing.
Management becomes more complex to overcome the quirks of HW & increased capacities.
Years ago, just for fun, I formatted a HD with 1k clusters because I wanted to see how much of a slowdown the increased overhead would create--I remember it being quite pronounced and quickly jumped back to 4k clusters. I was surprised at how much of slow down it created. That was many years ago--I can't even recall what version of Windows I was using at the time...;)
There's really too many variables and too little data to give a good answer at this point. Some applications will be really ill-suited to running on zoned storage, and may not gain any performance. Even for applications that are a good fit for zoned storage, the most important benefits may be to latency/QoS metrics that are less straightforward to interpret than throughput.
The Radian/IBM Research case study mentioned near the end of the article claims 65% improvement to throughput and 22x improvement to some tail latency metric for a Sysbench MySQL test. That's probably close to best-case numbers.
Fantastic article, both in-depth and accessible, a great primer for what's coming up on the horizon. This is what excellence in tech journalism looks like!
Agree with @Carmen00. Super well written. Fingers crossed that one of these “Not a rotating rust emulator” architectures can get airborne. As long as the flash memory chip designers are unconstrained to do great things to reduce cost generation to generation with the SSD maintaining the fixed abstraction I’m all for this.
Great article Billy. A couple of pointers to other parts of the ecosystem that are being upstreamed at the moment are:
- QEMU support for ZNS emulation (several patches posted in the mailing list) - Extensions to fio: Currently posted and waiting for stabilizing support for append in the kernel - nvme-cli: Several patches for ZNS management are already merged
Also, a comment to xZTL is that it is intended to be used on several LSM-based databases. We ported RocksDB as a first step, but other DBs are being ported on top. xZTL gives the necessary abstractions for the DB backend to be pretty thin - you can see the RocksDB HDFS backend as an example.
It seems to me that ZNS strikes the right abstraction balance:
It leaves wear leveling to the device, which probably does know more about wear and device characteristics, and the interface makes the job of wear leveling more straightforward than the classic block interface.
A key-value would cover a significant part of what a file system does, and it seems to me that after all these years, there is still enough going on in this area that we do not want to bake it into drive firmware.
Everything up to the "Supporting Multiple Writers" section seemed pretty universally positive... and then it all got a bit hazy for me. Kinda seems like they introduced a whole new problem, there?
I guess if this isn't meant to go much further than enterprise hardware then it likely won't be much of an issue, but still, that's a pretty keen limitation.
Great article, by the way. Realised I didn't mention that, but I really appreciate the perspective that's in-depth but not too-in-depth for the average tech-head 😁
As long as the zone is not divided between file systems, or direct-access databases, it is natural that writes are are synchronized and sequenced. And talking to the SSD through one NVMe/PCIe interface means that all writes (even to multiple zones) are sent to the drive in sequence.
OTOH, you have software and hardware with synchronous interfaces (waits for some feedback before sending the next request), and in such a setting doing everything through one thread costs throughput.
So you can either design everything to work with asynchronous interfaces (e.g., SCSI tagged command queuing), at least at all single-thread levels, or you design synchronous interfaces that work with multiple threads. The "write it somewhere, and then tell where you wrote" approach seems to be along the latter lines. What's the status of asynchronous interfaces for NVMe?
> Kinda seems like they introduced a whole new problem, there?
Sort of, many of these drives are meant to be used with the API's directly accessible to your application. Which means the application now has to solve problems that the OS hand-waved away.
If your application uses the filesystem API, only the filesystem has to worry about this. But if you want to application to leverage the determinism and parallelism available in the ZNS drives then it should be able to leverage the zone append command (which is the big advance over the zbc api of SMR hard drives)
So as a DRAM cost play this might save low single percentage point savings of the parts costs which seems like not enough motivation to consider. So clearly most the savings comes from reduced over provisioning of the flash needed to get get a similar write application and trading that against the extra lift required by the host. Curious if anyone has shared TCO studies on this to validate a clear cost savings for all the heavy lifting required by both drive and data center customer.
Thanks for the comprehensive write-up, Billy. It's great to see you writing about ZNS on Anandtech - I've followed it for 20 years! I haven't thought that my work on creating Open-Channel SSDs and Zoned Namespaces would one day be featured on Anandtech. Thanks for making it mainstream!
This is the best explanation and analysis I have found on ZNS. Thank you, Billy!
I have a question about how 4KB LBAs are persisted when the flash page size is 16KB and when 4 pages (QLC) are stored on each wordline.
You mentioned that a 4KB LBA is partially programmed into flash page, but is vulnerable to read disturb. But my understanding so far is that only SLC supports partial-page programming. So a 4KB LBA would need to be buffered in NVRAM within the SSD until there is a full page (16KB) worth of data to write to a page. Then, the wordline is partially programmed, because the 3 other pages have not been programmed yet, so the wordline still needs to be protected through additional caching or buffering in NVRAM.
Could you or someone else please either confirm or correct my understanding?
Again, I really appreciate the effort and thinking that has gone into this article.
very helpful article, thanks a lot! As you wrote: "A recent Western Digital demo with a 512GB ZNS prototype SSD showed the drive using a zone size of 256MB (for 2047 zones total) but also supporting 2GB zones." could you please kindly tell me where it was published! thanks in advance!!
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
45 Comments
Back to Article
Luminar - Thursday, August 6, 2020 - link
So if this is software, what's stopping this vendors from making this a retroactive firmware update?BinaryTB - Thursday, August 6, 2020 - link
$$$$$Luminar - Thursday, August 6, 2020 - link
If only drive controllers were jailbreakable.althaz - Friday, August 7, 2020 - link
I mean, they technically are. But it's not really worth the expense.close - Friday, August 7, 2020 - link
Storage is the kind of segment where I'd 100% take reliability over performance. Slow storage is still storage, unreliable storage is a data shredder.JlHADJOE - Friday, August 7, 2020 - link
ヽ༼ຈل͜ຈ༽ノSteven Wells - Saturday, August 8, 2020 - link
Great point @close. So many concepts of host managed storage all come down to “but Mr Drive Vendor this is fully reliable, right? Even if I totally screw up the host side management?”close - Sunday, August 9, 2020 - link
Well nothing is 100% reliable and you should always plan on the possibility of it failing. But for example "jailbreaking" the controller to use some form of custom firmware that would increase performance almost certainly at the cost of reliability is really like shooting all of your storage at its feet to make it run faster. Sooner or later you'll hit the foot.Go for something that's validated and proven if your data matters. There's no guarantee, just a better likelihood of not loosing it.
dotjaz - Friday, August 7, 2020 - link
If only jailbreak QA is better than OEM, oh wait, there is none.Billy Tallis - Thursday, August 6, 2020 - link
I think Radian might actually be doing that. In general though, adding a feature of this size and complexity in a firmware update is something your customers will want to put through a full QA/qualification cycle before deploying. And by the time a customer is done updating their software stack to use ZNS, it's probably time to buy (or start qualifying) new SSDs anyways.FreckledTrout - Thursday, August 6, 2020 - link
Like most things its the cost. I bet the testing alone is prohibitive to back port this into older SSD drives.xenol - Thursday, August 6, 2020 - link
Bingo. Testing and support costs something. Though I suppose they could release it for older drives under a no-support provision.Except depending on who tries this, I'm sure it's inevitable someone will break something and complain that they're not getting support.
DigitalFreak - Thursday, August 6, 2020 - link
Why spend the money to make a retroactive firmware, when you can just sell the user a new drive with the updated spec? If someone cares enough about this, they'll shell out the $$$ for a new drive.IT Mamba - Monday, December 14, 2020 - link
Easier said then done.https://www.manntechnologies.net
Grizzlebee11 - Thursday, August 6, 2020 - link
I wonder how this will affect Optane performance.Billy Tallis - Thursday, August 6, 2020 - link
Optane has no reason to adopt a zoned model, because the underlying 3D XPoint memory supports in-place modification of data.name99 - Saturday, August 8, 2020 - link
Does it really? I know Intel made a big deal about this, but isn't the reality (not that it changes your point, but getting the technical details right)- the minimum Optane granularity unit is a 64B line (which, admittedly, is the effective same as DRAM, but DRAM could be smaller if necessary, Optane???)
- the PRACTICAL Optane granularity unit (which is what I am getting at in terms of "in-place"), giving 4x the bandwidth, is 256B.
Yeah, I'm right. Looking around I found this
https://www.usenix.org/system/files/fast20-yang.pd...
which says "the 3D-XPoint physical media access granularity is 256 bytes" with everything that flows from that: need for write combining buffers, RMW if you can't write-combine, write amplification power/lifetime concerns, etc etc.
So, sure, you don't have BIG zones/pages like flash -- but it's also incorrect (both technically, and for optimal use of the technology) to suggest that it's "true" random access, as much so as DRAM.
It remains unclear to me how much of the current disappointment around Optane DIMM performance, eg
https://www.extremetech.com/computing/288855-repor...
derives from this. Certainly the Optane-targeted algorithms and new file systems I was reading say 5 years ago, when Intel was promising essentially "flash density, RAM performance" seemed very much optimized for "true" random access with no attempts at clustering larger than a cache line.
Wouldn't be the first time Intel advertising department's lies landed up tanking a technology because of the ultimate gap between what was promised (and designed for) vs what was delivered...
MFinn3333 - Sunday, August 9, 2020 - link
Um... Optane DIMM's have not disappointed anybody in their performance.https://www.storagereview.com/review/supermicro-su...
https://arxiv.org/pdf/1903.05714.pdf Shows just how
brucethemoose - Thursday, August 6, 2020 - link
Optane is byte addressable like DRAM and fairly durable, isn't it? I don't think this "multi kilobyte zoned storage" approach would be any more appropriate than the spinning rust block/sector model.Then again, running Optane over PCIe/NVMe always seemed like a waste to me.
FunBunny2 - Friday, August 7, 2020 - link
"Optane is byte addressable like DRAM and fairly durable, isn't it?"yes, and my first notion was that Optane would *replace* DRAM/HDD/SSD in a 'true' 64 bit address single level storage space. although slower than DRAM, such an architecture would write program variables as they change direct to 'storage' without all that data migration. completely forgot that current cpu use many levels of buffers between registers and durable storage. iow, there's really no byte addressed update in today's machines.
back in the 70s and early 80s, TI (and some others, I think) built machines that had no data registers in/on the cpu, all instructions happened in main memory and all data was written directly in memory and then to disc. the morphing to load/store architectures with scads of buffering means that optimum use of an Optane store with such an architecture looks to be a waste of time until/if cpu architecture writes data based on transaction scope of applications, not buffer fill.
jeremyshaw - Monday, August 10, 2020 - link
The early 70s and 80s timeframe saw CPUs and Memory scaling roughly the same, year to year. After a while, memory advanced a whole lot slower, necessitating the multiple tiers of memory we have now, from L1 cache to HDD. Modern CPUs didn't become lots of SRAM with at attached ALU just because CPU designers love throwing their transistor budget into measly megabytes of cache. They became that way, simply because other tiers of memory and storage are just too slow.WorBlux - Wednesday, December 22, 2021 - link
Modern CPU's have instruction that let you skip cache, and then there was SPARC with streaming accelerators, where you could unleash a true vector/CUDA style instruction directly against a massive chunk of memory.Arbie - Thursday, August 6, 2020 - link
An excellent article; readable and interesting even to those (like me) who don't know the tech but with depth for those who do. Right on the AT target.Arbie - Thursday, August 6, 2020 - link
And - I appreciated the "this is important" emphasis so I knew where to pay attention.ads295 - Friday, August 7, 2020 - link
+1 all the waybatyesz - Thursday, August 6, 2020 - link
UltraRAM is the next big step in the computer market.tygrus - Thursday, August 6, 2020 - link
The first 512-sectors I remember is going back to the days of IBM XT compatibles, 5¼ inch floppies, 20MB HDD, MSDOS, FAT12 & FAT16. That well over 30 years of baggage is heavy to carry around. They moved to 32bit based file systems and 4KB blocks/clusters or larger (eg. 64 or 128bit addresses, 2MB blocks/clusters are possible).It wastes space to save small files/fragments in large blocks but it also wastes resources to handle more locations (smaller blocks) with longer addresses taking up more space and processing.
Management becomes more complex to overcome the quirks of HW & increased capacities.
WaltC - Tuesday, August 11, 2020 - link
Years ago, just for fun, I formatted a HD with 1k clusters because I wanted to see how much of a slowdown the increased overhead would create--I remember it being quite pronounced and quickly jumped back to 4k clusters. I was surprised at how much of slow down it created. That was many years ago--I can't even recall what version of Windows I was using at the time...;)Crazyeyeskillah - Thursday, August 6, 2020 - link
I'll ask the dumb questions no one else has posted:What kind of performance numbers will this equate to?
Cheers
Billy Tallis - Thursday, August 6, 2020 - link
There's really too many variables and too little data to give a good answer at this point. Some applications will be really ill-suited to running on zoned storage, and may not gain any performance. Even for applications that are a good fit for zoned storage, the most important benefits may be to latency/QoS metrics that are less straightforward to interpret than throughput.The Radian/IBM Research case study mentioned near the end of the article claims 65% improvement to throughput and 22x improvement to some tail latency metric for a Sysbench MySQL test. That's probably close to best-case numbers.
Carmen00 - Friday, August 7, 2020 - link
Fantastic article, both in-depth and accessible, a great primer for what's coming up on the horizon. This is what excellence in tech journalism looks like!Steven Wells - Saturday, August 8, 2020 - link
Agree with @Carmen00. Super well written. Fingers crossed that one of these “Not a rotating rust emulator” architectures can get airborne. As long as the flash memory chip designers are unconstrained to do great things to reduce cost generation to generation with the SSD maintaining the fixed abstraction I’m all for this.Javier Gonzalez - Friday, August 7, 2020 - link
Great article Billy. A couple of pointers to other parts of the ecosystem that are being upstreamed at the moment are:- QEMU support for ZNS emulation (several patches posted in the mailing list)
- Extensions to fio: Currently posted and waiting for stabilizing support for append in the kernel
- nvme-cli: Several patches for ZNS management are already merged
Also, a comment to xZTL is that it is intended to be used on several LSM-based databases. We ported RocksDB as a first step, but other DBs are being ported on top. xZTL gives the necessary abstractions for the DB backend to be pretty thin - you can see the RocksDB HDFS backend as an example.
Again, great article!
Billy Tallis - Friday, August 7, 2020 - link
Thanks for the feedback, and for your presentations that were a valuable source for this article!Javier Gonzalez - Friday, August 7, 2020 - link
Happy to hear that it helped.Feel free to reach out if you have questions on a follow-up article :)
jabber - Friday, August 7, 2020 - link
And for all that, will still slow to Kbps and take two hours when copying a 2GB folder full of KB sized microfiles.We now need better more efficient file systems not hardware.
AntonErtl - Friday, August 7, 2020 - link
Thank you for this very interesting article.It seems to me that ZNS strikes the right abstraction balance:
It leaves wear leveling to the device, which probably does know more about wear and device characteristics, and the interface makes the job of wear leveling more straightforward than the classic block interface.
A key-value would cover a significant part of what a file system does, and it seems to me that after all these years, there is still enough going on in this area that we do not want to bake it into drive firmware.
Spunjji - Friday, August 7, 2020 - link
Everything up to the "Supporting Multiple Writers" section seemed pretty universally positive... and then it all got a bit hazy for me. Kinda seems like they introduced a whole new problem, there?I guess if this isn't meant to go much further than enterprise hardware then it likely won't be much of an issue, but still, that's a pretty keen limitation.
Spunjji - Friday, August 7, 2020 - link
Great article, by the way. Realised I didn't mention that, but I really appreciate the perspective that's in-depth but not too-in-depth for the average tech-head 😁AntonErtl - Saturday, August 8, 2020 - link
As long as the zone is not divided between file systems, or direct-access databases, it is natural that writes are are synchronized and sequenced. And talking to the SSD through one NVMe/PCIe interface means that all writes (even to multiple zones) are sent to the drive in sequence.OTOH, you have software and hardware with synchronous interfaces (waits for some feedback before sending the next request), and in such a setting doing everything through one thread costs throughput.
So you can either design everything to work with asynchronous interfaces (e.g., SCSI tagged command queuing), at least at all single-thread levels, or you design synchronous interfaces that work with multiple threads. The "write it somewhere, and then tell where you wrote" approach seems to be along the latter lines. What's the status of asynchronous interfaces for NVMe?
WorBlux - Wednesday, December 22, 2021 - link
> Kinda seems like they introduced a whole new problem, there?Sort of, many of these drives are meant to be used with the API's directly accessible to your application. Which means the application now has to solve problems that the OS hand-waved away.
If your application uses the filesystem API, only the filesystem has to worry about this. But if you want to application to leverage the determinism and parallelism available in the ZNS drives then it should be able to leverage the zone append command (which is the big advance over the zbc api of SMR hard drives)
Steven Wells - Saturday, August 8, 2020 - link
So as a DRAM cost play this might save low single percentage point savings of the parts costs which seems like not enough motivation to consider. So clearly most the savings comes from reduced over provisioning of the flash needed to get get a similar write application and trading that against the extra lift required by the host. Curious if anyone has shared TCO studies on this to validate a clear cost savings for all the heavy lifting required by both drive and data center customer.matias.bjorling - Monday, August 10, 2020 - link
Thanks for the comprehensive write-up, Billy. It's great to see you writing about ZNS on Anandtech - I've followed it for 20 years! I haven't thought that my work on creating Open-Channel SSDs and Zoned Namespaces would one day be featured on Anandtech. Thanks for making it mainstream!umeshm - Monday, August 24, 2020 - link
This is the best explanation and analysis I have found on ZNS. Thank you, Billy!I have a question about how 4KB LBAs are persisted when the flash page size is 16KB and when 4 pages (QLC) are stored on each wordline.
You mentioned that a 4KB LBA is partially programmed into flash page, but is vulnerable to read disturb. But my understanding so far is that only SLC supports partial-page programming. So a 4KB LBA would need to be buffered in NVRAM within the SSD until there is a full page (16KB) worth of data to write to a page. Then, the wordline is partially programmed, because the 3 other pages have not been programmed yet, so the wordline still needs to be protected through additional caching or buffering in NVRAM.
Could you or someone else please either confirm or correct my understanding?
Again, I really appreciate the effort and thinking that has gone into this article.
Umesh
weilinzhu - Monday, June 20, 2022 - link
very helpful article, thanks a lot! As you wrote: "A recent Western Digital demo with a 512GB ZNS prototype SSD showed the drive using a zone size of 256MB (for 2047 zones total) but also supporting 2GB zones." could you please kindly tell me where it was published! thanks in advance!!