|
1
|
|
|
2
|
- Pipeline Performance Metrics
- Synchronous vs. Asynchronous
- Forward and Backward Latencies
- Linear Pipeline Performance
- Cycle Time, Latency, Throughput, Dynamic Slack/Occupancy
- Occupancy vs. # of Tokens Graph
- Computing Peak Throughput and Average # of Tokens Occupancy
- Performance of Pipeline Rings
- An Example: GCD Implementation
- Pipelining the GCD datapath
- Optimising Pipelines and Rings
|
|
3
|
- Synchronous Design
- Latency of a pipeline
- Measured in terms of # of clock cycles
- Throughput
- Typically measured in terms of results per second
- Inverse of Clock Cycle Time for systems that generate a result each
cycle
- Asynchronous Pipelined Design (i.e., using pipelined handshaking)
- Latency
- Time between tokens consumed at inputs and generated outputs
- Inputs tokens spread apart to avoid congestion slowing down results
- Cycle Time
- Taken as long-term average of time between successive output tokens
- Throughput
- Results (tokens) per second - inverse of cycle time
- Data-dependent delays
- Block-level delays and data-flow may be data-dependent
|
|
4
|
- Forward latency (FL) – block level
- Time between tokens consumed at inputs
and generated outputs
- Inputs tokens spread apart to avoid congestion slowing down results
- May be data-dependent
- (Forward) latency – system
- A sum of forward latencies through blocks
- Must account for causality of output tokens within blocks
- Earlier/latest arriving token may cause output token
- I.e., notion of critical path exists
|
|
5
|
- Local cycle time
- Shortest time to complete a handshake with its neighbours
- Cycle may involve three neighbours for half-buffers
- Lower-bound on performance
- Equals FL + BL
- Backward latency (BL)
- Time needed to reset before accepting new tokens
- Time between generated output and earliest time of subsequent input
|
|
6
|
|
|
7
|
|
|
8
|
|
|
9
|
|
|
10
|
- Forward Latency (FL)
- Backward Latency (BL)
- Local Cycle Time
|
|
11
|
- Cycle Time
- T = FL + BL (simplest case)
- Latency
- input to output delay = N * FL
- Throughput
- # of tokens flowing per unit time, generally = 1/T = 1/(FL + BL)
- Depends on throughput of sender/receiver
- Static Slack or Static Token Occupancy
- Maximum token capacity of the pipeline
- Spread
- distance between successive tokens in a full pipeline
- token distance travelled for 1 T = (FL + BL)/FL
- Dynamic Slack or Dynamic Occupancy
- Average token capacity in the pipeline during operation
- N*1/Spread = (N*FL)/(FL + BL)
|
|
12
|
- Dynamic Slack or Dynamic Occupancy Formula for N buffers:
- Assumptions:
- Tokens not stalled by buffers or BitBucket resetting
- Tokens inserted at rate of local cycle time (FL + BL)
- Tokens consumed at rate of local cycle time
- Can be as small as N/9 (depending on circuit type)
|
|
13
|
- Throughput is zero
- When no tokens in pipe
- When pipeline is full of tokens
- Peak throughput in-between
- Token limited region
- Faster BitGen
improves throughput
- Bubble limited region
- Faster BitBucket
improves performance
|
|
14
|
|
|
15
|
|
|
16
|
|
|
17
|
|
|
18
|
- Peak throughput is limited by:
- Worst-case (largest) local cycle time
- Dynamic slack/Occupancy
- Becomes a range of # of tokens
- Data-limited slope
- Inverse of sum of FLs (?)
- Bubble-limited slope
- Inverse of sum of BLs (?)
|
|
19
|
- function gcd(A, B)
- while A ≠ B
- if A > B
- A := A - B
- else
- B := B - A
- return O = A
- Definition (informal)
- Pipeline buffers configured in a loop
- Can be combined with forks, joins
- Used in implementing iterative algorithms
- Each iteration implemented by a token traversing the loop
- Multiple tokens in loop possible
- Each token independent of others
- Implements “multi-threading”, i.e. pipelining
|
|
20
|
- Implementation Notes
- MUXs are same as MERGEs but consume both input tokens
- TB is a token buffer
- Generates a token on initialization with configurable value
- Acts as a buffer afterwards
- FORK cells implied by branching channels (for clarity)
- All cells use pipeline handshaking
- Architectural Feature
- Contains many pipeline rings
- Single Token around the ring
- Does one GCD at a time!
|
|
21
|
- Operation
- TB generates tokens to select input tokens come in on PIs
A and B
- Tested for equality which controls how they are routed
- If != routed to SUBs & ‘<‘
- Otherwise, A is routed to output
- SUBs concurrently generate differences.
- Specific difference routed back to merge controlled by ‘<‘ and MUXs
|
|
22
|
- Initial block
- Mechanism to send out initial token
- Init
- Value of initial token sent out
- Configurable via Verilog parameter feature
- After initial block, never used again
- Always block
- Performs steady-state behavior
- Just like BUF cell
|
|
23
|
- Add another Token Buffer (TBs)
- Pipelines design by enabling two sets of tokens in loop simultaneously
- Second set enters immediately, before first set completes algorithm
- Each set represents a thread and moves around loop independent of other
set
- No interference due to handshaking
- The Purpose
- Multiple instances of GCD algorithm solved simultaneously
- Can improve throughput
- Completion may be
Out-Of-Order (OOO)
- depending on # of iterations per input
|
|
24
|
- Buffer Performance
- Ring Architecture
- Three tokens with four full-buffers
- N.B. # of tokens in ring is constant
- Performance Analysis
- Each token can move forward every 3 * BL = 12 time units
- Example: token #3 moves forward at t = 2 and t = 14
- Each token completes an iteration every 12 * 4 = 48 time units
- Three tokens do this concurrently
- Completing 3 iterations every 48 time units
- Yields, throughput of 1 iteration every 48/3 = 16 time units
|
|
25
|
- Intuition
- Pipeline ring is like a linear pipeline with output channel tied to
input channel
- Optimal performance
- No pipeline buffer starved
- No pipeline buffer stalled
- Dynamic slack/Occupancy (for full-buffers):
|
|
26
|
- Bubble-limited loops
- Can improve performance by adding pipeline buffers
- Intuition
- Bubbles need to flow backwards less distance for tokens to flow forward
- Data-limited loops
- Increase multi-threading
- Shorten loop latency
|
|
27
|
- Slowest fork-join path determines input-output latency
|
|
28
|
- Intuition
- Number of tokens in each branch of fork-join is identical
- Throughput versus # tokens
- Lower-bounded by triangle graphs of individual pipelines
- Performance characteristics
- Static slack is minimum of two individual pipelines
- Peak throughput can be lower than either of individual pipelines
|
|
29
|
- Shorter pipeline (lower static slack) may not be bottleneck
- Can happen if peak throughout of shorter pipeline is larger than longer
pipeline
- See Figure (a) above
- Equal static slack is not always optimal
- i.e., adding buffers may improve peak throughput despite causing a
static slack imbalance
- See Figure (b) above
|
|
30
|
- Performance of asynchronous pipelines is complex
- Largely due to presence of backward latency of pipeline buffers
- Throughput versus # of tokens graph
- Effective way to analyze simple pipeline structures
- Provides good intuition of many issues
- More complex pipeline structures popular
- For example, forks / joins / conditional in pipeline loops
- e.g., GCD example
- Need more powerful methods of analyzing and optimizing pipelining
- Covered later on using performance models
|