Wednesday 25 January 2017

Binary Multiplication

Up to now we have managed to complete all the projects using only logical operations and addition or subtraction. But then there comes a time when you need multiplication - and this is where FPGAs really shine.
After going through the basics of binary multiplication you’ll be introduced to the embedded multiplier blocks in the Spartan 3E. The XC3S250E FPGAs have twelve of these blocks, allowing you to do number crunching of well over two billion 18-bit multiplications per second, allowing it to compete with a desktop CPU core.

Performance of binary multiplication

Binary multiplication is complex - implementing multiplication of any two n-bit numbers requires approximately n*(n-1) full adders, n half adders and n*n AND operations.

To multiply the four-bit binary numbers "abcd" by "efgh" the following operations are needed (where & is a binary AND):

+                                           a&h    b&h      c&h     d&h
+                               a&g     b&g    c&g      d&g      0
+                   a&f      b&f     c&f     d&f       0           0
+         a&e   b&e      c&e     d&e      0          0           0
     ---   ---      ---         ---       ---       ---        ---          ---
=    ?     ?        ?            ?        ?          ?           ?           ?

Multiplication also has a big implication for your design’s performance - because of the carries required, multiplying two n-bit numbers takes around twice as long as adding two n-bit numbers.

It also consumes a very large amount of logic if multiplication is implemented in the generic logic blocks within the FPGA.

Multiplication in FPGAs

To get around this, most FPGAs include multiple multiplier blocks - an XC3S100 has four 18 bit x 18 bit multipliers, and a XC3S250 has twelve!

To improve performance, multipliers also include additional registers, allowing the multiplicands and result to be registeredwithin the multiplier block. There are also optional registers within the multiplier that hold the partial resulthalf way through the multiplication.

Using these internal registers greatly improves throughput performance by removing the routing delays to get the inputs to and from the multipliers, but at the cost of increased latency - measured in either time between two numbers being given to the multiplier and the result being available, or the number of clock cycles.

When all these internal registers are enabled the multiplier works as follows:

Clock                          cycle Action
 0                                A and B inputs are latched
 1                                The first half of the multiplication is performed
 2                                The second half of the multiplication is performed
 3                                The result of the multiplication is available on the P output

Multipliers can accept a new set of A and B values each clock cycle, so up to three can be in flight at any one time. In some cases this is useful but in other cases it can be annoying.

A useful case might be processing Red/Green/Blue video values, where each channel is separate.

An annoying case is where feedback is needed of the output value back into the input. If the math isn’t in your favor you may be better off not using any registers at all - it may even be slightly faster and running at one-third the clock speed will use less power.

What if 18x18 isn’t wide enough?

What if you want to use bigger numbers? Say 32 bits? Sure!

Just like in decimal when multiplying pairs of two-digit numbers "ab" and "cd" is calculated as "a*c*10*10 + b*c*10 + a*d*10 + b*d" the same works - just replace each of a,v,c,d with an 18 bit number, and each 10 with 2ˆ18.

As the designer you have the choice of either:

• using four multipliers and three adders, with a best-performance latency of 5 cycles, and with a throughput of one pair of A and B values per clock
• using the same multiplier to calculate each of the four intermediate products, with a best-performance latency of 13 cycles (four 3-cycle multiplications plus the final addition) and with careful scheduling you can process three input pairs every 12 cycles

Project - Digital volume control

• Revisit the Audio output project
• Use the CORE Generator to add an IP multiplier, with an 8 bit unsigned input for the volume and the other matching your BRAM’s sample size
• Add a multiplier between the block BRAM and the DAC, using the value of switches as the other input
• Use the highest output bits of the multiplier to feed the DAC
• If you get the correct signed / unsigned settings for each input of the multiplier you will now be able to control the volume with the switches

Unless you are careful you may have issues with mismatching signed and unsigned values - it pays to simulate this carefully!

• You can also implement multiplication using the generic logic ("LUT"s). If interested, you can change the IP multiplier to use LUTs instead of the dedicated multiplier blocks and compare maximum performance and resource usage.

No comments:

Post a Comment