Sunday, 15 January 2017

Project - Binary up counter

• Using the above template, extend the project to use a 30-bit counter ("29 downto 0"), displaying the top 8 bits on the LEDs. Remember to add a new constraint that forces the clk signal to be assigned the correct pin for your FPGA board’s "clock" signal.

# Constraints for Papilio One
NET "clk" LOC = "P89" | IOSTANDARD = LVCMOS25 ;
# Constraints for the Basys2
NET "Clk" LOC = "B8";
NET "Clk" CLOCK_DEDICATED_ROUTE = FALSE;

Project - Binary down counter

• Change the project to count down

 Project - Binary up/down counter

• Use one of the switches to indicate the direction to count.

Challenges

• Change the project to count accurately in (binary) seconds
• Change the project to time how long it takes to turn a switch on and off - you will need a second switch to reset the counter too!

Assessing the speed of a design

So now we have a design that knows the passing of time, but how quick can we clock it? How can we find what limits the performance of the design? If required, how can we make things faster?

The problem of timing closure

When working with FPGAs half the problem is getting a working design. The other half of the problem is getting the design to work fast enough! A lot of effort and trial and error can go into finding a solution that meets your design’s timing requirements, and sometimes the best solution is not the obvious one.

It may be possible to change your FPGA for a faster grade part or use vendor specific macros to improve the performance of a design, but often there are significant gains to be made in improving your design that do not incur additional cost or limit your design to one architecture.

Even if your original design easily meets your requirements, it is a good idea to look at the critical path and try to improve upon it. Not only is timing closure one of these problems where the more you practice the better you get, but usually a faster design will have quicker build times and fewer timing issues to resolve as the design tools will have more slack when implementing a design.

This chapter’s scenario

Imagine we are designing a project that needs to capture the timing and duration of a pulse to with better than 10ns accuracy, and these pulses occur within an interval of four seconds.

To give some design margin this calls for a 250MHz clock for the counter-- giving at most 4ns of uncertainty around the start and end of the pulse, and a worst case of 8ns of uncertainty around the width of each pulse. Due to the timings of up to 4 seconds (1,000,000,000 ticks) a 30-bit counter is required.

The goals of this design are simple-- make a 30-bit counter that runs at 250MHz that can be used to time-stamp events.

So how fast can a design run?

The answer is not what you most probably expect, rather than "You can clock this FPGA at X MHz" it is "as fast as the chosen FPGA and your design allows".

Let’s have a quick look at the two halves of this answer. . .

How the choice of FPGA changes speed

FPGAs come in different speed grades. A different speed grade indicates that the device’s performance is guaranteed to meet or exceed a set of modeling parameters, and it is these modeling parameters that allow the design tools to calculate the performance limits of your design.

Once the design has been compiled and mapped to the FPGA components, the tools calculate every path from input/output pins and flip-flops and total the delay every step of the way (much like finding the critical path in project management software). This is then used to generate a report that can be reviewed from the "Static Timing" section of the "Design Summary Report" window:

The most useful number is usually right down at the bottom:

Clock to Setup on destination clock clk
---------------+---------+---------+---------+---------+
| Src:Rise Src:Fall| Src:Rise| Src:Fall|Source Clock |Dest:Rise |Dest:Rise|Dest:Fall|Dest:Fall|
---------------+---------+---------+---------+---------+
clk | 4.053| | | |
---------------+---------+---------+---------+---------+
As this design has a minimum clock of 4.053 nanoseconds, it can be clocked at up to 246MHz and still be within the FPGA’s timing limits.
This is not fast enough in this scenario. Perhaps I could choose to use a faster grade FPGA. The gains for using a faster, more expensive chip are only minimal-- for example, going from a Spartan 3E -4 to -5 grade increases a sample design’s maximum speed by 13%.

How design decisions determine speed

Each flip-flop’s input and output act as a start or finish line for a race. When a clock signal ticks, the design’s flip-flops assumes a new value and the updated signals come out of the flip-flops and ripples out through connected logic cells until all logic signals are stable. At that point we are almost ready for all the flip-flops to update their internal signals to values. For a design to work correctly, the updated signal has to arrive at the flip-flop with at least enough time to ensure that when the clock ticks again the signal will be reliably captured.

So the three major components of this ’race’ are:

• Routing time - the time it takes to "charge the wires" that route signals between different logic cells. As you can well imagine, the drive strength of the source signal, the length of these wires and the number of gates connected to the wires ("fan-out") dictates how much current is required to accurately transfer signals across the FPGA, and therefore the routing time.
• Logic time - this is the time it takes for logic cells to react to a change of input and generate their new output values.
• Setup time - the time required to ensure that the destination of a signal will accurately capture a changed value on the next clock transition.

As a sweeping generalization, the more complex work that is carried out each clock cycle the greater the number of logic blocks in the critical path and the slower the design will run. As expected, the more you reduce the complexity of your design the quicker your design will run.

In some cases you can also use components from your FPGA vendor’s library of standard building blocks. These building blocks will usually have more efficient implementations as they will leverage architecture-specific features within logic blocks (such as fast carry chains). The cost for using these features is that your design becomes architecture dependent and will need to be re-engineered if you move to a different FPGA.

You would think that a basic component like a 30-bit counter would be hard to improve on, but even in this simple design gains of 20% can be achieved!

Can it be made to run faster without changing the design?

Like using an optimizing compiler, giving the EDA tools a hint of how fast you need the design to run may improve things. When the EDA tools map the design to the FPGA they can be asked to take timing into account - causing them to attempt different placements for components on the FPGA until a timing constraint is met. The "Static Timing Report" will detail any errors (errors are where the amount of "slack" time between a signal’s source and destination is less than zero - a negative "slack" means "not enough time").

The quick way to do this

The simple way to add a constraint is to include it in the Implementation constraints file by including these two lines:
NET "clk" TNM_NET = clk;
TIMESPEC TS_clk = PERIOD "clk" 4 ns HIGH 50%;

The time of 4 nanoseconds ("4 ns") gives a constraint of 250MHz to aim for.

Note

Usually you will use the actual clock period of the design (31.25ns for the Papilio One or 20ns for the Basys2 running at 50MHz)

The long way to do this

Timing constraints can also be entered using the GUI tools. You first have to successfully compile your project, so the tools can deduce what clocks are present. Then, in the process window, open the "Timing Constraints" tool:


You will then be presented with the list of all unconstrained clocks:

Double-click on the unconstrained clock, and you will be presented with this dialogue box:


Fill it in appropriately then close all the timing constraint related windows (forcing the constraint to be saved).

You will now need to rebuild the project with this new constraint in place.

What happens if the tools are unable to meet the constraint?

If the design is unable to meet the timing requirements, details of the failing paths will be reported in the Static Timing Report:

...
Timing constraint: TS_clk = PERIOD TIMEGRP "clk" 4 ns HIGH 50%;
465 paths analyzed, 73 endpoints analyzed, 1 failing endpoint
1 timing error detected. (1 setup error, 0 hold errors, 0 component switching limit errors)
Minimum period is 4.053ns.
--------------------------------------------------------------------------------
Paths for end point counter_29 (SLICE_X53Y78.CIN), 28 paths
--------------------------------------------------------------------------------
Slack (setup path): -0.053ns (requirement - (data path - clock path skew + uncertainty))
Source: counter_0 (FF)
Destination: counter_29 (FF)
Requirement: 4.000ns
Data Path Delay: 4.053ns (Levels of Logic = 15)
Clock Path Skew: 0.000ns
Source Clock: clk_BUFGP rising at 0.000ns
Destination Clock: clk_BUFGP rising at 4.000ns
Clock Uncertainty: 0.000ns
Maximum Data Path: counter_0 to counter_29
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
SLICE_X53Y64.XQ Tcko 0.514 counter<0>
counter_0
SLICE_X53Y64.F4 net (fanout=1) 0.317 counter<0>
SLICE_X53Y64.COUT Topcyf 1.011 counter<0>
Mcount_counter_lut<0>_INV_0
Mcount_counter_cy<0>
Mcount_counter_cy<1>
SLICE_X53Y65.CIN net (fanout=1) 0.000 Mcount_counter_cy<1>
SLICE_X53Y65.COUT Tbyp 0.103 counter<2>
Mcount_counter_cy<2>
Mcount_counter_cy<3>
SLICE_X53Y66.CIN net (fanout=1) 0.000 Mcount_counter_cy<3>
SLICE_X53Y66.COUT Tbyp 0.103 counter<4>
Mcount_counter_cy<4>
Mcount_counter_cy<5>
SLICE_X53Y67.CIN net (fanout=1) 0.000 Mcount_counter_cy<5>
SLICE_X53Y67.COUT Tbyp 0.103 counter<6>
Mcount_counter_cy<6>
Mcount_counter_cy<7>
SLICE_X53Y68.CIN net (fanout=1) 0.000 Mcount_counter_cy<7>
SLICE_X53Y68.COUT Tbyp 0.103 counter<8>
Mcount_counter_cy<8>
Mcount_counter_cy<9>
SLICE_X53Y69.CIN net (fanout=1) 0.000 Mcount_counter_cy<9>
SLICE_X53Y69.COUT Tbyp 0.103 counter<10>
Mcount_counter_cy<10>
Mcount_counter_cy<11>
SLICE_X53Y70.CIN net (fanout=1) 0.000 Mcount_counter_cy<11>
SLICE_X53Y70.COUT Tbyp 0.103 counter<12>
Mcount_counter_cy<12>
Mcount_counter_cy<13>
SLICE_X53Y71.CIN net (fanout=1) 0.000 Mcount_counter_cy<13>
SLICE_X53Y71.COUT Tbyp 0.103 counter<14>
Mcount_counter_cy<14>
Mcount_counter_cy<15>
SLICE_X53Y72.CIN net (fanout=1) 0.000 Mcount_counter_cy<15>
SLICE_X53Y72.COUT Tbyp 0.103 counter<16>
Mcount_counter_cy<16>
Mcount_counter_cy<17>
SLICE_X53Y73.CIN net (fanout=1) 0.000 Mcount_counter_cy<17>
SLICE_X53Y73.COUT Tbyp 0.103 counter<18>
Mcount_counter_cy<18>
Mcount_counter_cy<19>
SLICE_X53Y74.CIN net (fanout=1) 0.000 Mcount_counter_cy<19>
SLICE_X53Y74.COUT Tbyp 0.103 counter<20>
Mcount_counter_cy<20>
Mcount_counter_cy<21>
SLICE_X53Y75.CIN net (fanout=1) 0.000 Mcount_counter_cy<21>
SLICE_X53Y75.COUT Tbyp 0.103 counter<22>
Mcount_counter_cy<22>
Mcount_counter_cy<23>
SLICE_X53Y76.CIN net (fanout=1) 0.000 Mcount_counter_cy<23>
SLICE_X53Y76.COUT Tbyp 0.103 counter<24>
Mcount_counter_cy<24>
Mcount_counter_cy<25>
SLICE_X53Y77.CIN net (fanout=1) 0.000 Mcount_counter_cy<25>
SLICE_X53Y77.COUT Tbyp 0.103 counter<26>
Mcount_counter_cy<26>
Mcount_counter_cy<27>
SLICE_X53Y78.CIN net (fanout=1) 0.000 Mcount_counter_cy<27>
SLICE_X53Y78.CLK Tcinck 0.872 counter<28>
Mcount_counter_cy<28>
Mcount_counter_xor<29>
counter_29
------------------------------------------------- ---------------------------
Total 4.053ns (3.736ns logic, 0.317ns route)
(92.2% logic, 7.8% route)
...

Because this is such a simple design only minimal automatic improvement was achieved, but it allows you to see where the crunch is-- it is the time taken for the signal to propagate from "counter(0)" through to "counter(29)".

Also worthy of note is that each step along the way takes 0.103ns, indicating that there is a fundamental relationship between the size of numbers you manipulate and a design’s speed-- a 32-bit counter would take 0.206ns longer to compute the result.

Interestingly enough, given that 3.736ns of the time are incurred by logic and only 0.317ns is incurred by routing, it would be easy to assume that a 30 bit counter cannot be implemented to run faster than 267MHz in this FPGA-- any faster and there would not be enough time for the logic to do its magic. You would be wrong-- the final design in this chapter runs at 298MHz!

So how can something as simple as a counter be improved?

Here’s the existing design-- it is a small and easy to understand design. It also allows easy verification by checking that the LEDs count at the same speed whenever changes are made:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity Switches_LEDs is
Port ( switches : in STD_LOGIC_VECTOR(7 downto 0);
LEDs : out STD_LOGIC_VECTOR(7 downto 0);
clk : in STD_LOGIC
);
end Switches_LEDs;
architecture Behavioral of Switches_LEDs is
signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => ’0’);
begin
LEDs <= counter(29 downto 22);
clk_proc: process(clk, counter)
begin
if rising_edge(clk) then
counter <= counter+1;
end if;
end process;
end Behavioral;

As mentioned earlier, the problem is the length of "counter"-- at 30 bits long it will take at least 30*0.103ns = 3.09ns to increment.

How about splitting the counter into two 15-bit counters? Will it be any faster? Let us see-- Here’s the updated design:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity Switches_LEDs is
Port ( switches : in STD_LOGIC_VECTOR(7 downto 0);
LEDs : out STD_LOGIC_VECTOR(7 downto 0);
clk : in STD_LOGIC
);
end Switches_LEDs;
architecture Behavioral of Switches_LEDs is
signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => ’0’);
signal incHighNext : STD_LOGIC := ’0’;
begin
LEDs <= counter(29 downto 22);
clk_proc: process(clk, counter)
begin
if rising_edge(clk) then
counter(29 downto 15) <= counter(29 downto 15)+incHighNext;
if counter(14 downto 0) = "111111111111110" then
incHighNext <= ’1’;
else
incHighNext <= ’0’;
end if;
counter(14 downto 0) <= counter(14 downto 0)+1;
end if;
end process;
end Behavioral;

Here’s the updated timing report:

All values displayed in nanoseconds (ns)

Clock to Setup on destination clock clk

---------------+---------+---------+---------+---------+
| Src:Rise| Src:Fall| Src:Rise| Src:Fall|
Source Clock |Dest:Rise|Dest:Rise|Dest:Fall|Dest:Fall|
---------------+---------+---------+---------+---------+
clk | 3.537| | | |
---------------+---------+---------+---------+---------+
Timing summary:
---------------
Timing errors: 0 Score: 0 (Setup/Max: 0, Hold: 0)
Constraints cover 284 paths, 0 nets, and 82 connections
Design statistics:
Minimum period: 3.537ns{1} (Maximum frequency: 282.725MHz)

By using a little more logic the design has gone from 246MHz to 282MHz - about 14% faster.

Why does this work? By exploiting what we know in advance (which is when there will be a ’carry’ from bit 14 to bit 15), and storing that in a handy flip-flop (incHighNext) we have split the critical path across two clock cycles.

Project - More speed!

• See what the maximum speed of the design using the 30-bit counter is for your FPGA board
• Try changing the speed grade of the FPGA and see what difference that makes to the timing. (To do this, in the hierarchy window right-click on the chip (e.g. xc3c100e-4cp132) and choose ’properties’-- just remember to set it back later on!)
• Add a timing constraint and try again
• See what the maximum safe clock speed is when using the 15+15 split counter design on your FPGA board

Challenges

• Is the 15+15 split counter optimal compared against a 14+16 split or 16+14 split? If not, why not?
• Can you increase the maximum clock speed for the project even further?
• What is the largest counter you can make that runs at 100MHz?

An even better design?

The code below is definately not a better design. It is an illustration that by trying different structures it is sometimes possible to eek out the last bit of performance. This is much like using assember in software development.

Although logically equivilent the code below performs faster still. How this works is a very complex discussion and is EDA tools specific, but it shows how it is sometimes possible to improve on the generated code.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity Switches_LEDs is
Port ( switches : in STD_LOGIC_VECTOR(7 downto 0);
LEDs : out STD_LOGIC_VECTOR(7 downto 0);
clk : in STD_LOGIC
);
end Switches_LEDs;
architecture Behavioral of Switches_LEDs is
signal counter : STD_LOGIC_VECTOR(29 downto 0) := (others => ’0’);
signal incHighNext : STD_LOGIC := ’0’;
begin
LEDs <= counter(29 downto 22);
clk_proc: process(clk, counter)
begin
if rising_edge(clk) then
counter(29 downto 15) <= counter(29 downto 15)+incHighNext;
incHighNext <= not counter(0) and counter(1) and counter(2) and counter(3)
and counter(4) and counter(5) and counter(6) and counter(7)
and counter(8) and counter(9) and counter(10) and counter(11)
and counter(12) and counter(13) and counter (14);
counter(14 downto 0) <= counter(14 downto 0)+1;
end if;
end process;
end Behavioral;

It runs at 298MHz on my FPGA!

Random thoughts on timing

• If there is a chance that performance will be an issue, consider setting a metric for the "levels of logic" you have in your design, and regularly review your design’s static timing during the project. This is a very simple way to ensure that you will not end up with one or two very long chains of logic that limit design performance requiring significant re-work to resolve
• Vendors will tell you that "floorplanning" (the high-level planning of the placement of logic in a design) will allow projects to make significant improvements in achieving timing closure, but only routing delays can be reduced by controlling the placement. This is much like saying that an optimizing compiler can fix code performance issues
• The best way to solve tricky timing closure issues is to avoid the timing issues in the first place with clean, simple designs
• It is usually possible to identify the critical ares within a project in advance, allowing the feasibility of a design to be assessed early on in a project
• Design all projects with speed in mind, even if the design requirements do not dictate it. A design that can run very fast is also very efficient on power when running at lower clock speeds, and it is great practice
• "Mapping" options can have a big difference on the final design performance. In general, settings that reduce the size of a design make the design slower (due to high fan-outs and merging of flip-flops)
• The same design will usually run faster on a larger FPGA due to greater freedom in the place and route process. Likewise, downsizing an existing design into an FPGA that is ’just big enough’ will lower performance.


No comments:

Post a Comment