Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processor Applications

Similar presentations


Presentation on theme: "Processor Applications"— Presentation transcript:

1 Processor Applications
General Purpose - high performance Alpha’s, SPARC, MIPS .. Used for general purpose software Heavy weight OS - UNIX, NT Workstations, PC’s Embedded processors and processor cores ARM, 486SX, Hitachi SH7000, NEC V800 Single program Lightweight, often realtime OS DSP support Cellular phones, consumer electronics (e.g. CD players) Microcontrollers Extremely cost sensitive Small word size - 8 bit common Highest volume processors by far Automobiles, toasters, thermostats, ... Increasing Cost Increasing volume #1 lec # Fall

2 Processor Markets $30B 32-bit micro $5.2B/17% $1.2B/4% 32 bit DSP
#2 lec # Fall

3 The Processor Design Space
Application specific architectures for performance Embedded processors Microprocessors Performance is everything & Software rules Performance Microcontrollers Cost is everything Cost #3 lec # Fall

4 Market for DSP Products
Mixed/ Signal Analog DSP DSP is the fastest growing segment of the semiconductor market #4 lec # Fall

5 DSP Applications Audio applications MPEG Audio Portable audio
Digital cameras Wireless Cellular telephones Base station Networking Cable modems ADSL VDSL #5 lec # Fall

6 Another Look at DSP Applications
High-end Wireless Base Station - TMS320C6000 Cable modem gateways Mid-end Cellular phone - TMS320C540 Fax/ voice server Low end Storage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, ... Increasing Cost Increasing volume #6 lec # Fall

7 DSP range of applications
#7 lec # Fall

8 DSP ARCHITECTURE Enabling Technologies
#8 lec # Fall

9 CELLULAR TELEPHONE SYSTEM
1 2 3 4 5 6 7 8 9 CONTROLLER RF MODEM PHYSICAL LAYER PROCESSING BASEBAND CONVERTER A/D SPEECH ENCODE SPEECH DECODE DAC #9 lec # Fall

10 HW/SW/IC PARTITIONING
MICROCONTROLLER 1 2 3 4 5 6 7 8 9 CONTROLLER RF MODEM PHYSICAL LAYER PROCESSING BASEBAND CONVERTER ASIC A/D SPEECH ENCODE SPEECH DECODE DAC DSP ANALOG IC #10 lec # Fall

11 Mapping Onto A System-on-a-chip
S/P phone book keypad intfc DMA protocol control S/P RAM RAM µC DMA speech quality enhancment voice recognition DSP CORE ASIC LOGIC de-intl & decoder RPE-LTP speech decoder demodulator and synchronizer Viterbi equalizer #11 lec # Fall

12 Example Wireless Phone Organization
C540 ARM7 #12 lec # Fall

13 Multimedia I/O Architecture
Radio Modem Embedded Processor Sched ECC Pact Interface Low Power Bus FB Fifo Fifo Video Decomp SRAM Pen Data Flow Graphics Audio Video #13 lec # Fall

14 Multimedia System-on-a-Chip
E.g. Multimedia terminal electronics Uplink Radio Downlink Radio Graphics Out Video I/O Voice I/O Pen In µP DSP Coms Video Unit custom Memory Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O #14 lec # Fall

15 Requirements of the Embedded Processors
Optimized for a single program - code often in on-chip ROM or off chip EPROM Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost Lowest possible area Technology behind the leading edge High level of integration of peripherals (reduces system cost) Fast time to market Compatible architectures (e.g. ARM) allows reuseable code Customizable core Low power if application requires portability #15 lec # Fall

16 Area of processor cores = Cost
Nintendo processor Cellular phones #16 lec # Fall

17 Another figure of merit: Computation per unit area
Nintendo processor Cellular phones #17 lec # Fall

18 Code size If a majority of the chip is the program stored in ROM, then code size is a critical issue The Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate #18 lec # Fall

19 DSP BENCHMARKS - DSPstone
ZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHEN APPLICATION BENCHMARKS ADPCM TRANSCODER - CCITT G.721 REAL_UPDATE COMPLEX_UPDATES DOT_PRODUCT MATRIX_1X3 CONVOLUTION FIR FIR2DIM HR_ONE_BIQUAD LMS FFT_INPUT_SCALED #19 lec # Fall

20 Evolution of GP and DSP General Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC) DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering) ASP to DSP because DSP insensitive to environment (e.g., same response in snow or desert if it works at all) DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation Different history and different applications led to different terms, different metrics, some new inventions Convergence of markets will lead to architectural showdown #20 lec # Fall

21 Embedded Systems vs. General Purpose Computing - 1
Runs a few applications often known at design time Not end-user programmable Operates in fixed run-time constraints, additional performance may not be useful/valuable Differentiating features: power cost speed (must be predictable) General purpose computing Intended to run a fully general set of applications End-user programmable Faster is always better Differentiating features speed (need not be fully predictable) cost (largest component power) #21 lec # Fall

22 DSP vs. General Purpose MPU
DSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ... DSPs sometimes run hard real-time apps You must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD. DSPs have an infinite continuous data stream #22 lec # Fall

23 DSP vs. General Purpose MPU
The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the time. The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers In DSPs, algorithms are important: Binary compatability not an issue High-level Software is not (yet) important in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip. #23 lec # Fall

24 TYPES OF DSP PROCESSORS
DSP Multiprocessors on a die TMS320C80 TMS320C6000 32-BIT FLOATING POINT TI TMS320C4X MOTOROLA 96000 AT&T DSP32C ANALOG DEVICES ADSP21000 16-BIT FIXED POINT TI TMS320C2X MOTOROLA 56000 AT&T DSP16 ANALOG DEVICES ADSP2100 #24 lec # Fall

25 Architectural Features of DSPs
Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate Multiple memory banks and buses - Harvard Architecture Multiple data memories Specialized addressing modes Bit-reversed addressing Circular buffers Specialized instruction set and execution control Zero-overhead loops Support for MAC Specialized peripherals for DSP #25 lec # Fall

26 DSP Data Path: Arithmetic
DSPs dealing with numbers representing real world => Want “reals”/ fractions DSPs dealing with numbers for addresses => Want integers Support “fixed point” as well as integers . S -1 Š x < 1 radix point . S –2N–1 Š x < 2N–1 radix point #26 lec # Fall

27 DSP Data Path: Precision
Word size affects precision of fixed point numbers DSPs have 16-bit, 20-bit, or 24-bit data words Floating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed point DSP programmers will scale values inside code SW Libraries Separate explicit exponent “Blocked Floating Point” single exponent for a group of fractions Floating point support simplify development #27 lec # Fall

28 DSP Data Path: Overflow
DSP are descended from analog : Modulo Arithmetic. Set to most positive (2N–1–1) or most negative value(–2N–1) : “saturation” Many algorithms were developed in this model #28 lec # Fall

29 DSP Data Path: Multiplier
Specialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplier Need to perform multiply-accumulate (MAC) n-bit multiplier => 2n-bit product #29 lec # Fall

30 DSP Data Path: Accumulator
Don’t want overflow or have to scale accumulator Option 1: accumalator wider than product: “guard bits” Motorola DSP: 24b x 24b => 48b product, 56b Accumulator Option 2: shift right and round product before adder Accumulator ALU Multiplier Shift G #30 lec # Fall

31 DSP Data Path: Rounding
Even with guard bits, will need to round when store accumulator into memory 3 DSP standard options Truncation: chop results => biases results up Round to nearest: < 1/2 round down,  1/2 round up (more positive) => smaller bias Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even #31 lec # Fall

32 Data Path Comparison General-Purpose Processor DSP Processor
Multiplies often take>1 cycle Shifts often take >1 cycle Other operations (e.g., saturation, rounding) typically take multiple cycles. DSP Processor Specialized hardware performs all key arithmetic operations in 1 cycle. Hardware support for managing numeric fidelity: Shifters Guard bits Saturation #32 lec # Fall

33 320C54x DSP Functional Block Diagram
#33 lec # Fall

34 DSP Algorithm Format DSP culture has a graphical format to represent formulas. Like a flowchart for formulas, inner loops, not programs. Some seem natural:  is add, X is multiply Others are obtuse: z–1 means take variable from earlier iteration. These graphs are trivial to decode #34 lec # Fall

35 DSP Algorithm Notation
Uses “flowchart” notation instead of equations Multiply is or X Add is or +  Delay/Storage is or or Delay z–1 D #35 lec # Fall

36 FIR Filtering: A Motivating Problem
M most recent samples in the delay line (Xi) New sample moves data down delay line “Tap” is a multiply-add Each tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay line Goal: 1 FIR Tap / DSP instruction cycle #36 lec # Fall

37 FINITE-IMPULSE RESPONSE (FIR) FILTER
#37 lec # Fall

38 FIR filter on (simple) General Purpose Processor
loop: lw x0, 0(r0) lw y0, 0(r1) mul a, x0,y0 add y0,a,b sw y0,(r2) inc r0 inc r1 inc r2 dec ctr tst ctr jnz loop Problems: Bus / memory bandwidth bottleneck, control code overhead #38 lec # Fall

39 First Generation DSP (1982): Texas Instruments TMS32010
Instruction Memory 16-bit fixed-point “Harvard architecture” separate instruction, data memories Accumulator Specialized instruction set Load and Accumulate 390 ns Multiple-Accumulate (MAC) time; 228 ns today Processor Data Memory Datapath: Mem T-Register Multiplier P-Register ALU Accumulator #39 lec # Fall

40 TMS32010 FIR Filter Code Here X4, H4, ... are direct (absolute) memory addresses: LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 MPY H2 ... Two instructions per tap, but requires unrolling #40 lec # Fall

41 Micro-architectural impact - MAC
element of finite-impulse response filter computation X Y MPY ADD/SUB ACC REG #41 lec # Fall

42 Mapping of the filter onto a DSP execution unit
4 6 S D X Xn b a Yn aYn-1 1 3 2 4 5 6 1 2 D 5 3 The critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycle This means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback #42 lec # Fall

43 MAC Eg. - 320C54x DSP Functional Block Diagram
#43 lec # Fall

44 DSP Memory FIR Tap implies multiple memory accesses
DSPs want multiple data ports Some DSPs have ad hoc techniques to reduce memory bandwdith demand Instruction repeat buffer: do 1 instruction 256 times Often disables interrupts, thereby increasing interrupt response time Some recent DSPs have instruction caches Even then may allow programmer to “lock in” instructions into cache Option to turn cache into fast program memory No DSPs have data caches May have multiple data memories #44 lec # Fall

45 Conventional ``Von Neumann’’ memory
#45 lec # Fall

46 HARVARD MEMORY ARCHITECTURE in DSP
PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X DATA Y DATA #46 lec # Fall

47 Memory Architecture Comparison
General-Purpose Processor Von Neumann architecture Typically 1 access/cycle Use caches DSP Processor Harvard architecture 2-4 memory accesses/cycle No caches-on-chip SRAM Program Memory Processor Processor Memory Data Memory #47 lec # Fall

48 Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture
#48 lec # Fall

49 Eg. 320C62x/67x DSP #49 lec # Fall

50 DSP Addressing Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC datapth busy Assumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => don’t use datapath to calculate fancy address Autoincrement/Autodecrement register indirect lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 Option to do it before addressing, positive or negative #50 lec # Fall

51 DSP Addressing: FFT FFTs start or end with data in bufferfly order
0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) 3 (011) => 6 (110) 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111) What can do to avoid overhead of address checking instructions for FFT? Have an optional “bit reverse” address addressing mode for use with autoincrement addressing Many DSPs have “bit reverse” addressing for radix-2 FFT #51 lec # Fall

52 BIT REVERSED ADDRESSING
Data flow in the radix-2 decimation-in-time FFT algorithm #52 lec # Fall

53 DSP Addressing: Buffers
DSPs dealing with continuous I/O Often interact with an I/O buffer (delay lines) To save memory, buffers often organized as circular buffers What can do to avoid overhead of address checking instructions for circular buffer? Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end Every DSP has “modulo” or “circular” addressing #53 lec # Fall

54 CIRCULAR BUFFERS Instructions accomodate three elements:
buffer address buffer size increment Allows for cyling through: delay elements coefficients in data memory #54 lec # Fall

55 Addressing Comparison
DSP Processor Dedicated address generation units Specialized addressing modes; e.g.: Autoincrement Modulo (circular) Bit-reversed (for FFT) Good immediate data support General-Purpose Processor Often, no separate address generation unit General-purpose addressing modes #55 lec # Fall

56 Address calculation unit for DSPs
Supports modulo and bit reversal arithmetic Often duplicated to calculate multiple addresses per cycle #56 lec # Fall

57 DSP Instructions and Execution
May specify multiple operations in a single instruction Must support Multiply-Accumulate (MAC) Need parallel move support Usually have special loop support to reduce branch overhead Loop an instruction or sequence 0 value in register usually means loop maximum number of times Must be sure if calculate loop count that 0 does not mean 0 May have saturating shift left arithmetic May have conditional execution to reduce branches #57 lec # Fall

58 ADSP 2100: ZERO-OVERHEAD LOOP
DO <addr> UNTIL condition” X DO X ... Address Generation PCS = PC + 1 if (PC = x && ! condition) PC = PCS else PC = PC +1 Eliminates a few instructions in loops - Important in loops with small bodies #58 lec # Fall

59 Instruction Set Comparison
DSP Processor Specialized, complex instructions Multiple operations per instruction General-Purpose Processor General-purpose instructions Typically only one operation per instruction mac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0 mov *r0,x mov *r1,y mpy x0, y0, a add a, b mov y0, *r inc r inc rl #59 lec # Fall

60 Specialized Peripherals for DSPs
Synchronous serial ports Parallel ports Timers On-chip A/D, D/A converters Host ports Bit I/O ports On-chip DMA controller Clock generators On-chip peripherals often designed for “background” operation, even when core is powered down. #60 lec # Fall

61 Specialized DSP peripherals
#61 lec # Fall

62 TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995
#62 lec # Fall

63 Summary of Architectural Features of DSPs
Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate Multiple memory banks and buses - Harvard Architecture Multiple data memories Specialized addressing modes Bit-reversed addressing Circular buffers Specialized instruction set and execution control Zero-overhead loops Support for MAC Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN. #63 lec # Fall

64 Texas Instruments TMS320 Family Multiple DSP P Generations
#64 lec # Fall

65 First Generation DSP P Case Study TMS32010 (Texas Instruments) - 1982
Features 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels #65 lec # Fall

66 TMS32010 BLOCK DIAGRAM #66 lec # Fall

67 Third Generation DSP P Case Study TMS320C30 - 1988
TMS320C30 Key Features 60 ns single-cycle instruction execution time 33.3 MFLOPS (million floating-point operations per second) 16.7 MIPS (million instructions per second) One 4K x 32-bit single-cycle dual-access on-chip ROM block Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks 64 x 32-bit instruction cache 32-bit instruction and data words, 24-bit addresses 40/32-bit floating-point/integer multiplier and ALU 32-bit barrel shifter Eight extended precision registers (accumulators) Two address generators with eight auxiliary registers and two auxiliary register arithmetic units On-chip direct memory Access (DMA) controller for concurrent I/O and CPU operation Parallel ALU and multiplier instructions Block repeat capability Interlocked instructions for multiprocessing support Two serial ports to support 8/16/32-bit transfers Two 32-bit timers 1  CDMOS Process #67 lec # Fall

68 TMS320C30 BLOCK DIAGRAM #68 lec # Fall

69 TMS320C3x CPU BLOCK DIAGRAM
#69 lec # Fall

70 TMS320C3x MEMORY BLOCK DIAGRAM
#70 lec # Fall

71 TMS320C30 FIR FILTER PROGRAM
Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0) For N=50, t=3.6 s (277 KHz) #71 lec # Fall

72 Texas Instruments TMS320C80 MIMD MULTIPROCESSOR DSP (1996)
#72 lec # Fall

73 16 bit Fixed Point VLIW DSP: TMS320C6201 Revision 2 (1997)
C6201 CPU Megamodule Data Path 1 D1 M1 S1 L1 A Register File Data Path 2 L2 S2 M2 D2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test Ext. Memory Interface 4-DMA Program Cache / Program Memory 32-bit address, 256-Bit data512K Bits RAM Host Port Interface 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM Pwr Dwn Instruction Decode #73 lec # Fall

74 C6201 Internal Memory Architecture
Separate Internal Program and Data Spaces Program 16K 32-bit instructions (2K Fetch Packets) 256-bit Fetch Width Configurable as either Direct Mapped Cache, Memory Mapped Program Memory Data 32K x 16 Single Ported Accessible by Both CPU Data Buses 4 x 8K 16-bit Banks 2 Possible Simultaneous Memory Accesses (4 Banks) 4-Way Interleave, Banks and Interleave Minimize Access Conflicts #74 lec # Fall

75 C62x Datapaths Registers A0 - A15 Registers B0 - B15 L 1 S1 M1 D1 D2
DL SL S1 SL DL D S1 S2 M1 D S1 S2 D1 D S1 S2 D2 S2 S1 D M2 S2 S1 D S2 S2 S1 D DL SL L2 SL DL D S2 S1 DDATA_I2 (load data) DDATA_I1 (load data) DDATA_O2 (store data) DDATA_O1 (store data) DADR1 (address) DADR2 (address) Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths #75 lec # Fall

76 C62x Functional Units L-Unit (L1, L2) S-Unit (S1, S2) M-Unit (M1, M2)
40-bit Integer ALU, Comparisons Bit Counting, Normalization S-Unit (S1, S2) 32-bit ALU, 40-bit Shifter Bitfield Operations, Branching M-Unit (M1, M2) 16 x 16 -> 32 D-Unit (D1, D2) 32-bit Add/Subtract Address Calculations #76 lec # Fall

77 C62x Instruction Packing Instruction Packing Advanced VLIW
Example 1 Fetch Packet CPU fetches 8 instructions/cycle Execute Packet CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets Parallelism determined at compile / assembly time Examples 1) 8 parallel instructions 2) 8 serial instructions 3) Mixed Serial/Parallel Groups A // B C D E // F // G // H Reduces Codesize, Number of Program Fetches, Power Consumption A B C D E F G H A B C D E F G H Example 2 A B C D Example 3 E F G H #77 lec # Fall

78 C62x Pipeline Operation Pipeline Phases
Fetch Decode Execute PG PS PW PR DP DC E1 E2 E3 E4 E5 Single-Cycle Throughput Operate in Lock Step Fetch PG Program Address Generate PS Program Address Send PW Program Access Ready Wait PR Program Fetch Packet Receive Decode DP Instruction Dispatch DC Instruction Decode Execute E1 - E5 Execute 1 through Execute 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5 Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5 #78 lec # Fall

79 C62x Pipeline Operation Delay Slots
Delay Slots: number of extra cycles until result is: written to register file available for use by a subsequent instructions Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact Most Instructions E1 No Delay Integer Multiply E1 E2 1 Delay Slots Loads E1 E2 E3 E4 E5 4 Delay Slots Branches E1 Branch Target PG PS PW PR DP DC E1 5 Delay Slots #79 lec # Fall

80 C6000 Instruction Set Features Conditional Instructions
All Instructions can be Conditional A1, A2, B0, B1, B2 can be used as Conditions Based on Zero or Non-Zero Value Compare Instructions can allow other Conditions (<, >, etc) Reduces Branching Increases Parallelism #80 lec # Fall

81 C6000 Instruction Set Addressing Features
Load-Store Architecture Two Addressing Units (D1, D2) Orthogonal Any Register can be used for Addressing or Indexing Signed/Unsigned Byte, Half-Word, Word, Double-Word Addressable Indexes are Scaled by Type Register or 5-Bit Unsigned Constant Index #81 lec # Fall

82 C6000 Instruction Set Addressing Features
Indirect Addressing Modes Pre-Increment *++R[index] Post-Increment *R++[index] Pre-Decrement *--R[index] Post-Decrement *R--[index] Positive Offset *+R[index] Negative Offset *-R[index] 15-bit Positive/Negative Constant Offset from Either B14 or B15 Circular Addressing Fast and Low Cost: Power of 2 Sizes and Alignment Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes Dual Endian Support #82 lec # Fall

83 32 Bit Floating Point VLIW DSP: TMS320C6701 (1997)
’C67x Floating-Point CPU Core Data Path 1 D1 M1 S1 L1 A Register File Data Path 2 L2 S2 M2 D2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test External Memory Interface 4 Channel DMA Program Cache/Program Memory 32-bit address, 256-Bit data 512K Bits RAM Host Port Interface 2 Timers 2 Multi-channel buffered serial ports (T1/E1) Data Memory 32-Bit address 8-, 16-, 32-Bit data Power Down Instruction Decode #83 lec # Fall

84 TMS320C6701 Advanced VLIW CPU (VelociTITM)
1 167 MHz 6-ns cycle time 6 x 32-bit floating-point instructions/cycle Load store architecture 3.3-V I/Os, 1.8-V internal Single- and double-precision IEEE floating-point Dual data paths 6 floating-point units / 8 x 32-bit instructions External interface supports SDRAM, SRAM, SBSRAM 4-channel bootloading DMA 16-bit host port interface 1Mbit on-chip SRAM 2 multichannel buffered serial ports (T1/E1) Pin compatible with ’C6201 #84 lec # Fall

85 TMS320C67x CPU Core Program Fetch Control Registers
’C67x Floating-Point CPU Core Program Fetch Control Registers Instruction Dispatch Instruction Decode Control Logic Data Path 1 Data Path 2 A Register File B Register File Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Interrupts Floating-Point Capabilities Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit #85 lec # Fall

86 C67x New Instructions .L Unit .M Unit .S Unit ADDSP ADDDP SUBSP SUBDP
INTSP INTDP SPINT DPINT SPTRUNC DPTRUNC DPSP MPYSP MPYDP MPYI MPYID MPY24 MPY24H ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP Floating Point Arithmetic Unit Floating Point Multiply Unit Floating Point Auxilary Unit #86 lec # Fall

87 C67x Datapaths D2 M1 D1 L 1 M2 Registers B0 - B15 Registers A0 - A15
L-Unit (L1, L2) Floating-Point, 40-bit Integer ALU Bit Counting, Normalization S-Unit (S1, S2) Floating Point Auxiliary Unit 32-bit ALU/40-bit shifter Bitfield Operations, Branching M-Unit (M1, M2) Multiplier: Integer & Floating-Point D-Unit (D1, D2) 32-bit add/subtract Addr Calculations 2 Data Paths 8 Functional Units Orthogonal/Independent 2 Floating Point Multipliers 2 Floating Point Arithmetic 2 Floating Point Auxiliary Control Independent Up to 8 32-bit Instructions Registers 2 Files 32, 32-bit registers total Cross paths (1X, 2X) D2 D S1 S2 M1 D1 2X 1X L 1 DL SL M2 Registers B0 - B15 Registers A0 - A15 L2 #87 lec # Fall

88 C67x Instruction Packing Instruction Packing Enhanced VLIW
Example 1 Fetch Packet CPU fetches 8 instructions/cycle Execute Packet CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets Parallelism determined at compile/assembly time Examples 1) 8 parallel instructions 2) 8 serial instructions 3) Mixed Serial/Parallel Groups A // B C D E // F // G // H Reduces Codesize Number of Program Fetches Power Consumption A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H #88 lec # Fall

89 C67x Pipeline Operation: Pipeline Phases
Fetch Decode Execute PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Operate in Lock Step Fetch PG Program Address Generate PS Program Address Send PW Program Access Ready Wait PR Program Fetch Packet Receive Decode DP Instruction Dispatch DC Instruction Decode Execute E1 - E5 Execute 1 through Execute 5 E6 - E10 Double Precision Only Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 #89 lec # Fall

90 C67x Pipeline Operation Delay Slots
Delay Slots: number of extra cycles until result is: written to register file available for use by a subsequent instructions Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact Most Integer E1 No Delay Single-Precision E1 E2 E3 E4 3 Delay Slots Loads E1 E2 E3 E4 E5 4 Delay Slots Branches E1 Branch Target PG PS PW PR DP DC E1 5 Delay Slots #90 lec # Fall

91 ’C67x and ’C62x Commonality
Driving commonality between ’C67x & ’C62x shortens ’C67x design time. Maintaining symmetry between datapaths shortens the ’C67x design time. ’C62x CPU ’C67x CPU M-Unit 1 Multiplier Unit Control Registers Emulation M-Unit 2 Multiplier Unit M-Unit 1 Multiplier Unit with Floating Point Control Registers Emulation M-Unit 2 Multiplier Unit with Floating Point D-Unit 1 Data Load/ Store D-Unit 2 Data Load/ Store D-Unit 1 Data Load/ Store D-Unit 2 Data Load/ Store S-Unit 1 Auxiliary Logic Unit S-Unit 2 Auxiliary Logic Unit S-Unit 1 Auxiliary Logic Unit with Floating Point S-Unit 2 Auxiliary Logic Unit with Floating Point L-Unit 1 Arithmetic Logic Unit Decode L-Unit 2 Arithmetic Logic Unit L-Unit 1 Arithmetic Logic Unit with Floating Point Decode L-Unit 2 Arithmetic Logic Unit with Floating Point Register file Register file Register file Register file Program Fetch & Dispatch Program Fetch & Dispatch #91 lec # Fall


Download ppt "Processor Applications"

Similar presentations


Ads by Google