. An Overview of CacheBefore I describe a basic cache model, I need to explain what Cache is. Cache is small highspeed memory usually Static RAM (SRAM) that contains the most recently accessed pieces ofmain memory.Why is this high speed memory necessary or beneficial? In today’s systems , the time it takes tobring an instruction (or piece of data) into the processor is very long when compared to the timeto execute the instruction. For example, a typical access time for DRAM is 60ns. A 100 MHzprocessor can execute most instructions in 1 CLK or 10 ns. Therefore a bottle neck forms at theinput to the processor. Cache memory helps by decreasing the time it takes to move informationto and from the processor. A typical access time for SRAM is 15 ns. Therefore cache memoryallows small portions of main memory to be accessed 3 to 4 times faster than DRAM (mainmemory).How can such a small piece of high speed memory improve system performance? The theorythat explains this performance is called “Locality of Reference.” The concept is that at any giventime the processor will be accessing memory in a small or localized region of memory. Thecache loads this region allowing the processor to access the memory region faster. How welldoes this work? In a typical application, the internal 16K-byte cache of a Pentium®processorcontains over 90% of the addresses requested by the processor. This means that over 90% ofthe memory accesses occurs out of the high speed cache.1So now the question, why not replace main memory DRAM with SRAM? The main reason iscost. SRAM is several times more expense than DRAM. Also, SRAM consumes more powerand is less dense than DRAM. Now that the reason for cache has been established, let look at asimplified model of a cache system.1Cache performance is directly related to the applications that are run in the system. Thisnumber is based on running standard desk top applications. It is possible to have lowerperformance depending on how the software is designed. Performance will not be discussed inthe paper.
--------------------------------------------------------------------------------
Page 2
An Overview of CachePage 22.1 Basic ModelCPUCache MemoryMainDRAMMemorySystem InterfaceFigure 2-1 Basic Cache ModelFigure 2-1 shows a simplified diagram of a system with cache. In this system, every time theCPU performs a read or write, the cache may intercept the bus transaction, allowing the cache todecrease the response time of the system. Before discussing this cache model, lets definesome of the common terms used when talking about cache.2.1.1 Cache HitsWhen the cache contains the information requested, the transaction is said to bea cache hit.2.1.2 Cache MissWhen the cache does not contain the information requested, the transaction issaid to be a cache miss.2.1.3 Cache ConsistencySince cache is a photo or copy of a small piece main memory, it is importantthat the cache always reflects what is in main memory. Some common termsused to describe the process of maintaining cache consistency are:2.1.3.1 SnoopWhen a cache is watching the address lines for transaction, this is calleda snoop. This function allows the cache to see if any transactions areaccessing memory it contains within itself.2.1.3.2 SnarfWhen a cache takes the information from the data lines, the cache issaid to have snarfed the data. This function allows the cache to beupdated and maintain consistency.Snoop and snarf are the mechanisms the cache uses to maintain consistency.Two other terms are commonly used to describe the inconsistencies in the cachedata, these terms are:2.1.3.3 Dirty DataWhen data is modified within cache but not modified in main memory,the data in the cache is called “dirty data.”2.1.3.4 Stale DataWhen data is modified within main memory but not modified in cache,the data in the cache is called stale data.
--------------------------------------------------------------------------------
Page 3
An Overview of CachePage 3Now that we have some names for cache functions lets see how caches are designed and howthis effects their function.2.2 Cache ArchitectureCaches have two characteristics , a read architecture and a write policy. The read architecturemay be either “Look Aside” or “Look Through.” The write policy may be either “Write-Back” or“Write-Through.” Both types of read architectures may have either type of write policy,depending on the design. Write policies will be described in more detail in the next section. Letsexamine the read architecture now.2.2.1 Look AsideCPUSRAMTag RAMCache ControllerSystem InterfaceFigure 2-2 Look Aside CacheFigure 2-2 shows a simple diagram of the “look aside “cache architecture. In this diagram, mainmemory is located opposite the system interface. The discerning feature of this cache unit isthat it sits in parallel with main memory. It is important to notice that both the main memory andthe cache see a bus cycle at the same time. Hence the name “look aside.”2.2.1.1 Look Aside Cache ExampleWhen the processor starts a read cycle, the cache checks to see if that address is acache hit.HIT:If the cache contains the memory location, then the cache will respond to theread cycle and terminate the bus cycle.MISS:If the cache does not contain the memory location, then main memory willrespond to the processor and terminate the bus cycle. The cache will snarf thedata, so next time the processor requests this data it will be a cache hit.Look aside caches are less complex, which makes them less expensive. This architecture alsoprovides better response to a cache miss since both the DRAM and the cache see the bus cycleat the same time. The draw back is the processor cannot access cache while another bus masteris accessing main memory.
--------------------------------------------------------------------------------
Page 4
An Overview of CachePage 42.2.2 Read Architecture: Look ThroughCPUSRAMTag RAMCache ControllerSystem InterfaceFigure 2-3 Look Through CacheFigure 2-3 shows a simple diagram of cache architecture. Again, main memory is locatedopposite the system interface. The discerning feature of this cache unit is that it sits betweenthe processor and main memory. It is important to notice that cache sees the processors buscycle before allowing it to pass on to the system bus.2.2.2.1 Look Through Read Cycle ExampleWhen the processor starts a memory access, the cache checks to see if that address is acache hit.HIT:The cache responds to the processor’s request without starting an access tomain memory.MISS:The cache passes the bus cycle onto the system bus. Main memory thenresponds to the processors request. Cache snarfs the data so that next time theprocessor requests this data, it will be a cache hit.This architecture allows the processor to run out of cache while another bus master is accessingmain memory, since the processor is isolated from the rest of the system. However, this cachearchitecture is more complex because it must be able to control accesses to the rest of thesystem. The increase in complexity increases the cost. Another down side is that memoryaccesses on cache misses are slower because main memory is not accessed until after thecache is checked. This is not an issue if the cache has a high hit rate and their are other busmasters.2.2.3 Write Policy:A write policy determines how the cache deals with a write cycle. The two common writepolicies are Write-Back and Write-Through.In Write-Back policy, the cache acts like a buffer. That is, when the processor starts a writecycle the cache receives the data and terminates the cycle. The cache then writes the data backto main memory when the system bus is available. This method provides the greatestperformance by allowing the processor to continue its tasks while main memory is updated at alater time. However, controlling writes to main memory increase the cache’s complexity andcost.The second method is the Write-Through policy. As the name implies, the processor writesthrough the cache to main memory. The cache may update its contents, however the write cycledoes not end until the data is stored into main memory. This method is less complex and
--------------------------------------------------------------------------------
Page 5
An Overview of CachePage 5therefore less expensive to implement. The performance with a Write-Through policy is lowersince the processor must wait for main memory to accept the data.2.3 Cache ComponentsThe cache sub-system can be divided into three functional blocks: SRAM, Tag RAM, and theCache Controller. In actual designs, these blocks may be implemented by multiple chips or allmay be combined into a single chip.2.3.1 SRAMStatic Random Access Memory (SRAM) is the memory block which holds the data. The size ofthe SRAM determines the size of the cache.2.3.2 Tag RAMTag RAM (TRAM) is a small piece of SRAM that stores the addresses of the data that is storedin the SRAM.2.3.3 Cache ControllerThe cache controller is the brains behind the cache. Its responsibilities include: performing thesnoops and snarfs, updating the SRAM and TRAM and implementing the write policy. Thecache controller is also responsible for determining if memory request is cacheable2and if arequest is a cache hit or miss.2.4 Cache OrganizationCache LineCache LineCache LineCache LineCache LineCache LineCache LineCache Line::CachePageCachePageCachePage:::Main MemoryFigure 2-4 Cache PageIn order to fully understand how caches can be organized, two terms need to be defined. Theseterms are cache page and cache line. Lets start by defining a cache page. Main memory isdivided into equal pieces called cache pages3. The size of a page is dependent on the size ofthe cache and how the cache is organized. A cache page is broken into smaller pieces, each2It is not desirable to have all memory cacheable. What regions of main memory determined tobe non-cacheable are dependent on the design. For example, in a PC platform the video regionof main memory is not cacheable.3A cache page is not associated with a memory page in page mode. The word page has severaldifferent meaning when referring to a PC architecture.
--------------------------------------------------------------------------------
Page 6
An Overview of CachePage 6called a cache line. The size of a cache line is determined by both the processor and the cachedesign. Figure 2-4 shows how main memory can be broken into cache pages and how eachcache page is divided into cache lines. We will discuss cache organizations and how todetermine the size of a cache page in the following sections.2.4.1 Fully-AssociativeLine m::Line 1Line 2Line 0Line 0Line 1Line 2Line n::Main MemoryCache MemoryFigure 2-5 Fully-Associative CacheThe first cache organization to be discussed is Fully-Associative cache. Figure 2-5 shows adiagram of a Fully Associative cache. This organizational scheme allows any line in mainmemory to be stored at any location in the cache. Fully-Associative cache does not use cachepages, only lines. Main memory and cache memory are both divided into lines of equal size.For example Figure 2-5 shows that Line 1 of main memory is stored in Line 0 of cache.However this is not the only possibility, Line 1 could have been stored anywhere within thecache. Any cache line may store any memory line, hence the name, Fully Associative.A Fully Associative scheme provides the best performance because any memory location can bestored at any cache location. The disadvantage is the complexity of implementing this scheme.The complexity comes from having to determine if the requested data is present in cache. Inorder to meet the timing requirements, the current address must be compared with all theaddresses present in the TRAM. This requires a very large number of comparators that increasethe complexity and cost of implementing large caches. Therefore, this type of cache is usuallyonly used for small caches, typically less than 4K.
--------------------------------------------------------------------------------
Page 7
An Overview of CachePage 72.4.2 Direct MapCache Memory.Page m:Line 0Line n.Page 1:Line 0Line n.Page 0:Line 0Line nMain Memory Pages...Line 0Line nFigure 2-6 Direct MappedDirect Mapped cache is also referred to as 1-Way set associative cache. Figure 2-6 shows adiagram of a direct map scheme. In this scheme, main memory is divided into cache pages.The size of each page is equal to the size of the cache. Unlike the fully associative cache, thedirect map cache may only store a specific line of memory within the same line of cache. Forexample, Line 0 of any page in memory must be stored in Line 0 of cache memory. Therefore ifLine 0 of Page 0 is stored within the cache and Line 0 of page 1 is requested, then Line 0 ofPage 0 will be replaced with Line 0 of Page 1. This scheme directly maps a memory line into anequivalent cache line, hence the name Direct Mapped cache.A Direct Mapped cache scheme is the least complex of all three caching schemes. DirectMapped cache only requires that the current requested address be compared with only onecache address. Since this implementation is less complex, it is far less expensive than the othercaching schemes. The disadvantage is that Direct Mapped cache is far less flexible making theperformance much lower, especially when jumping between cache pages.2.4.3 Set Associative.Page m:Line 0Line n.Page 1:Line 0Line n.Page 0:Line 0Line nMain Memory Pages...Line 0Line n...Line 0Line nCache MemoryWay 0Way 1Figure 2-7 2-Way Set AssociativeA Set-Associative cache scheme is a combination of Fully-Associative and Direct Mappedcaching schemes. A set-associate scheme works by dividing the cache SRAM into equalsections (2 or 4 sections typically) called cache ways. The cache page size is equal to the size ofthe cache way. Each cache way is treated like a small direct mapped cache. To make theexplanation clearer, lets look at a specific example. Figure 2-7 shows a diagram of a 2-Way Set-
--------------------------------------------------------------------------------
Page 8
An Overview of CachePage 8Associate cache scheme. In this scheme, two lines of memory may be stored at any time. Thishelps to reduce the number of times the cache line data is written-over?This scheme is less complex than a Fully-Associative cache because the number of comparitorsis equal to the number of cache ways. A 2-Way Set-Associate cache only requires twocomparitors making this scheme less expensive than a fully-associative scheme.3. The Pentium(R) Processors CacheThis section examines internal cache on the Pentium(R) processor. The purpose of this sectionis to describe the cache scheme that the Pentium(R) processor uses and to provide an overviewof how the Pentium(R) processor maintains cache consistency within a system.The above section broke cache into neat little categories. However, in actual implementations,cache is often a series of combinations of all the above mentioned categories. The concepts arethe same, only the boundaries are different.Pentium(R) processor cache is implemented differently than the systems shown in the previousexamples. The first difference is the cache system is internal to the processor, i.e. integratedinto the part Therefore, no external hardware is needed to take advantage of this cache4-helping to reduce the overall cost of the system. Another advantage is the speed of memoryrequest responses. For example, a 100MHz Pentium(R) processor has an external bus speed of66MHz. All external cache must operate at a maximum speed of 66mhz. However, an internalcache operates at 100MHz. Not only does the internal cache respond faster, it also has a widerdata interface. An external interface is only 64-bits wide while the internal interface between thecache and processor prefetch buffer is 256-bits wide. Therefore, a huge increase in performanceis possible by integrating the cache into the CPU.A third difference is that the cache is divided into two separate pieces to improve performance -a data cache and a code cache, each at 8K.. This division allows both code and data to readilycross page boundaries without having to overwrite one another.4In order for the Pentium Processor to know if an address is cacheable, KEN# must be asserted.For more detailed information regarding the KEN# signal or other Pentium Processor signals,please refer to the Pentium Processor Family Developer’s Manual Volume 1:Pentium®Processors, order number 241428. (way cool!)
--------------------------------------------------------------------------------
Page 9
An Overview of CachePage 9L2 CacheMemoryMainDRAMMemorySystem InterfaceL1 CahceMemoryCPUFigure 3-1 Pentium®Processor with L2 cacheWhen developing a system with a Pentium(R) processor, it is common to add an externalcache. External cache is the second cache in a Pentium(R) processor system, therefore it iscalled a Level 2 (or L2) cache. The internal processor cache is referred to as a Level 1 (or L1)cache. The names L1 and L2 do not depend on where the cache is physically located,( i.e.,internal or external). Rather, it depends on what is first accessed by the processor(i.e. L1 cacheis accessed before L2 whenever a memory request is generated). Figure 3-1 shows how L1 andL2 caches relate to each other in a Pentium(R) processor system.3.1 Cache Organization.Page m:Line 0Line 127.Page 1:Line 0Line 127.Page 0:Line 0Line 127Main Memory Pages...Line 0Line 127...Line 0Line 127Cache MemoryWay 0Way 1Figure 3-2 Internal Pentium®Processor Cache SchemeBoth caches are 2-way set-associative in structure. The cache line size is 32 bytes, or 256 bits.A cache line is filled by a burst of four reads on the processor’s 64-bit data bus. Each cache waycontains 128 cache lines. The cache page size is 4K, or 128 lines. Figure 3-2 shows a diagramof the 2-way set-associate scheme with the line numbers filled in.3.2 Operating ModesUnlike the cache systems discussed in section 2 (An Overview of Cache), the write policy on thePentium(R) processor allows the software to control how the cache will function. The bits thatcontrol the cache are the CD (cache disable) and NW (not write-through) bits. As the name
--------------------------------------------------------------------------------
Page 10
An Overview of CachePage 10suggests, the CD bit allows the user to disable the Pentium(R) processors internal cache. WhenCD = 1, the cache is disabled, CD = 0 cache is enabled. The NW bit allows the cache to beeither write-through (NW = 0) or write-back (NW = 1).3.3 Cache ConsistencyThe Pentium(R) processor maintains cache consistency with the MESI5protocol. MESI is usedto allow the cache to decide if a memory entry should be updated or invalidated. With thePentium(R) processor, two functions are performed to allow its internal cache to stay consistent,Snoop Cycles and Cache Flushing.The Pentium(R) processor snoops during memory transactions on the system bus6. That is,when another bus master performs a write, the Pentium(R) processor snoops the address. If thePentium(R) processor contains the data, the processor will schedule a write-back.Cache flushing is the mechanism by which the Pentium(R) processor clears its cache. A cacheflush may result from actions in either hardware or software. During a cache flush, thePentium(R) processor writes back all modified (or dirty) data. It then invalidates its cache,(i.e.,makes all cache lines unavailable). After the Pentium(R) processor finishes its write-backs, itthen generates a special bus cycle called the Flush Acknowledge Cycle. This signal allows lowerlevel caches, e.g. L2 caches, to flush their contents as well.4. ConclusionCaches are implemented in a variety of ways, though the basic concepts of cache are the same.As shown above in the overview of cache, most systems have the major pieces - SRAM, TRAM,and a controller. However, you can see the details in the Pentium(R) Processors implementationcross several of the initially defined boundaries. It is important to understand that thePentium(R) Processor uses only one method to implement cache. Their are many other validway to do the same things, but, they all lead to the same place. Cache is simply a high speedpiece of memory that stores a snapshot of main memory which enables the processor higherperformance.5MESI stands for the four states a cache line can be: Modified (M), Exclusive (E), Shared (S),and Invalid (I). For more detailed information about MESI, please refer to the Developer’sManual6This discussion assumes only Pentium®Processor internal cache and no external cache. Theconcept still applies if an L2 cache is present.
2007-02-07 04:39:57
·
answer #5
·
answered by ash 3
·
0⤊
3⤋