Advanced Flash Translation Layer

(1)

Advanced Flash Translation Layer

Jihong Kim

Dept. of CSE, SNU

(2)

Outline

• Problems of Hybrid-Mapping-Based FTL

• FTLs for Memory-Constrained Storage Systems

– DFTL

– μ-FTL

(3)

Hybrid FTL Schemes

• The main difficulties the FTL faces in giving high

performance is the severely constrained size of SRAM

– Coarse-grained mapping (block-level mapping)

• Small SRAM size / Poor garbage collection efficiency

– Fine-grained mapping (page-level mapping)

• Efficient garbage collection / Large SRAM size

(4)

Problems of Hybrid FTL Schemes

• Fail to offer good performance for enterprise- scale workloads

• Require workload-specific tunable parameters

• Not properly exploit the temporal locality in

accesses

(5)

Basic Approaches to Memory- Constrained Storage Systems

• Cached mapping information

• On-demand loading of mapping information

– DFTL

• Better data structures for mapping information

– μ-FTL

(6)

DFTL

• Hybrid FTLs suffer performance degradation due to full merges

– Caused by the difference in mapping granularity of data and log blocks

– A high performance FTL must be re-designed without log- blocks

• DFTL is an enhanced form of the page-level FTL scheme

– Allow requests to be serviced from any physical page on flash – All blocks can be used for servicing update requests

• How to make the fine-grained mapping scheme feasible with the constrained SRAM size

– Use an on-demand address translation mechanism

(7)

Demand-based Selective Caching of Page-level Address Mapping

• Propose a novel FTL scheme (DFTL) : Purely page- mapped FTL

– Exploit temporal locality of accesses

– Uses the limited SRAM to store the most popular mappings while the rest are maintained on flash

– Provide an easier-to-implement solution – Devoid of tunable parameters

(8)

DFTL Architecture

(9)

Data Blocks and Translation Blocks

• DFTL partitions all blocks into two groups

– Data blocks: composed of data pages

• Each data page contains the real data

– Translation blocks: consists of translation pages

• Each translation page stores information about logical-to- physical mappings

• Logically consecutive mappings information stored on a single page

• 512 logically consecutive mappings in a single page (page size: 2 KB, addr: 4 Byte)

(10)

Example:

When a Request Incurs a CMT miss

M_VPN= 2, FV

1024 570

D_LPN= 1280 FV

- -

1280 660

- -

0 21

1 17

2 15

3 22

Data

OOB M_VPN= 0,

FV

MVPN= 0, FV

1535 420

0 110

1 130

2 440

- -

0 110

1 130

2 440

- -

511 560

3 150

10 170 11 360

1 260

… …

D_LPN

D_PPN= 660

M_VPN M_PPN D_PPN

D_PPN= 661

511 560

D_LPN D_PPN D_LPN D_PPN D_LPN D_PPN

M_PPN= 15 M_PPN= 21 M_PPN= 23

D_{LPN =}1280 Global Translation

Directory Cached Mapping

Table

MISS Victim entry

21

260

23

1280 660

Data Page

Data Block Translation Block

(11)

Overhead in DFTL Address Translation

• The worst-case overhead in DFTL address translation

– Two translation page reads

• One for the victim by the replacement policy

• The other for the original requests

– One translation page write

• For the translation page write for the victim

• The address translation overhead can be mitigated

– The existence of temporal locality helps in reducing the # of evictions

– Batch updates for the pages co-located in the victim could also reduce the # of evictions

(12)

Read/Write Operation

• For a read operation

– Directly serviced through flash page read operation once the address translation is completed

• For a write operation

– Maintain two types of blocks for data block and translation blocks

• Current data blockand current translation block

– Sequentially writes the given data into these blocks

(13)

Garbage Collection

• Different steps are followed depending on the type of a victim block

– Translation block:

• Copy the valid pages to the current translation block

• Update the corresponding GTD

– Data block:

• Copy the valid pages to the current data block

• Update all translation pages and CMT entries associated with these pages

(14)

Example: Translation Block

• Translation block as victim for garbage collection

(15)

Example: Data Block

• Data block as victim for garbage collection

(16)

Evaluation Setup

 Parameters

− Flash memory size: 32 GB / SRAM size: 2 MB

− Log buffer size: 512 MB (about 3% of the total flash capacity)

− Evaluated schemes: FAST, baseline, DFTL

 Workloads

 Performance metrics

− Garbage collection’s efficacy

− Response time (device service time + queuing delay)

(17)

The Number of Block Merges

 Baseline and DFTL show a higher number of switch merges

 FAST incurs lots of full merges

(18)

Address Translation Overhead

• DFTL incurs extra overheads due to its translation mechanism – The address translation accounts for 90% of the extra overhead

• DFTL yields a 3-fold reduction in extra ops. over FAST – 63% hits for address translations in SRAM

Block Erases Extra Read/Write Ops.

(19)

Impact of SRAM size

 With the SRAM size approaching the working set size

− DFTL’s performance becomes comparable to Baseline (=page level FTL)

(20)

μ-FTL

• Design goal

• Reduce the RAM usage as small as block-mapped FTLs

• Providing the performance comparable to page-mapped FTLs

• Key observations: Two dominant workloads

• Coarse-grained mapping for large and sequential writes

• Fine-grained mapping for small and random writes

• Adjusts mapping granularities according to the size of incoming write requests

(21)

μ-FTL

Page-mapped FTL

• Low overhead

• Large RAM usage Block-mapped FTL

• High overhead

• Small RAM usage

μ-FTL

• Low overhead

• Small RAM usage

(22)

Address Mapping

• μ-FTL implements multiple mapping granularities

– Extent: A variable-sized mapping entry (a pair of key and record)

Logical blocks Physical blocks

5

0 4 5 6

100,

4 105,

1 104,

1 106,

2

Key Record

μ-FTL mapping example The implementation of 100

101 102 103 104 105 106 107 0

1 2 3 4 5 6 7

(23)

Garbage Collection

• Victim selection policy

– Having the largest # of invalid pages

• Garbage collection information

– Per-block invalid page counter

• Maintaining the number of invalid pages in each physical block

– Bitmap information

• Distinguishing valid pages from invalid pages

(24)

5 25

Bitmap Information

• Bitmap information is too large to be stored in RAM

• We store the bitmap information in μ-tree

0 4 5 6

100,

4 105,

1 104,

1 106,

2

Physical blocks

100 101 102 103

104 105 106 107

25 26

(1000)₂ (1111)₂ Key (bitmap) Record

1 1

25 26

Key (mapping)

1

(25)

Key Data Structure: μ-Tree

• Handling any insertion/deletion/update operation of the tree with only one flash write operation

– By storing multiple nodes from the leaf to the root in a single page

LRU head

μ-Tree cache

 cache miss

 cache full (dirty pages only)

A

B C

D E F

A’

C’

F’

A B C D E F A B C

D E F F’

A

B C C’

D E F F’

A A’

B C C’

D E F F’

A A’

B C C’

D E F F’

(26)

• Separating hot data from cold data has a beneficial effect on the performance of an FTL

• Each part of the logical address space exhibits different degree of

“hotness”

• Update block

– Being charged for receiving data from incoming write requests

Logical Address Space Partitioning

A global update block Many update blocks

Hot Cold Hot Cold Hot Cold Hot Cold

Logical address space Logical address space

(27)

μ-FTL Design Summary

Write

Bitmap Cache μ-Tree

Cache

Partition 1

Partition 2

Partition 3

Partition μ-Tree 4

Data Address

Info.

update

Bitmap update Bitmap update flush

Cache full Cache miss

(28)

Evaluation Methodology

• Trace-driven simulators for several FTLs

– Block-mapped: the log block, FAST, the super block – Page-mapped: DAC

• 6 traces

– 3 multimedia traces: PIC(8GB), MP3(8GB), MOV(8GB)

– 3 PC traces: WEB(32GB), GENERAL(32GB), SYSMARK(40GB)

• The standard configurations

Partition Size Bitmap cache size The portion of extra blocks 256MB

(512 logical blocks)

4KB for multimedia traces

32KB for PC traces 3%

(29)

The RAM Usage

Block-mapped FTL μ-FTL (A+B+C) Page-mapped FTL

8GB 64KB 16KB+44KB+4KB

= 64KB 8MB

32GB 256KB 64KB+160KB+32KB

= 256KB 32MB

40GB 320KB 80KB+208KB+32KB

= 320KB 40MB

(A) Per-block invalid page counter (B) The μ-Tree cache size

(30)

The Comparison of FTL Performance

0 1000 2000 3000 4000 5000 6000 7000

1% 2% 3% 4% 5%

Overhead(seconds)

Extra blocks

0 200 400 600 800 1000 1200 1400 1600

1% 2% 3% 4% 5%

Overhead(seconds)

Extra blocks

GENERAL (32GB) SYSMARK (40GB)

56.8% 89.7%

(31)

0 200 400 600 800 1000

32768 16384 8192 4096 2048 1024 512 256 128 64 32

Overhead(seconds)

Partition size (MB)

GC Mapping Entry Bitmap Entry

200 4060 10080 120140

32768 16384 8192 4096 2048 1024 512 256 128 64 32

Overhead(seconds)

Partition size (MB)

GC Mapping Entry Bitmap Entry

The Effect of Partitioning

WEB (32GB) GENERAL (32GB)

23.6% 39.6%

(32)

Reference

• Aayush Gupta et al., “DFTL: A flash translation layer employing

demand-based selective caching of page-level address mappings”, ASPLOS, 2009

• Yong-Goo Lee et al., “μ-FTL: A Memory-Efficient Flash Translation Layer Supporting Multiple Mapping Granularities,” EMSOFT, 2008

• Dongwon Kang et al, “μ-Tree : An Ordered Index Structure for NAND Flash Memory,” EMSOFT, 2007