2020-02-18
Changed service provider and email address after 20+ years. New email
address is found in the footer of this page.
2016-11-15
Re-released 1.0o with corrected version number in the output (it
incorrectly said 1.0m before).
2016-08-09
Maintenance release 1.0o. Second this day, I was a bit trigger happy on
the first. Here I've put in some minor bugfixes received from the Debian
package maintainer.
2016-08-09
Maintenance release 1.0n, no functional change.
2013-11-29
There was still a typo in the last uploaded 1.0m affecting
SSE2. Uploaded fix.
2013-11-28
There was a typo in the last uploaded 1.0m release causing the
S24_LE/S24_3LE formats to break for input. So if you downloaded 1.0m
yesterday please do it again.
2013-11-27
BruteFIR v1.0m. Fixed an SSE2 bug introduced in 1.0l. Added
'safety_limit' feature which can be used to protect your expensive
speakers (and sensitive ears). Also fixed a rare race condition bug
and further synchronized sample formats with ALSA, so now S24_4LE
means low 24 bits of 32 bit word. Thus if you used S24_4LE before you
should use S32_LE now to get the old behavior.
2013-10-06
BruteFIR v1.0l. Refreshed code to compile well on x86-64, dropped
3Dnow support and replaced the hand-coded SSE with SSE C code, and
refreshed JACK and ALSA I/O modules to catch up with changes in the
APIs. Also fixed a filter indexing bug in the cffa
CLI
command.
2009-03-31
BruteFIR v1.0k. Refreshed JACK and ALSA I/O modules to catch up with
changes in the APIs.
2009-03-05
BruteFIR v1.0j. Fixed a memory leak in the CLI.
As you may have noted, I do not any longer actively develop BruteFIR further. From my point of view the software is "complete". I have had plans to start a next generation BruteFIR from scratch with a more modern design (using threads instead of forked processes etc), but priorities in life change, and I do not any longer have much time to write code so it is not likely to happen.
However, I'm happy to see that there are many BruteFIR users out there. Do continue to report bugs as my intention is to keep the code working and fix any bugs that arise.
2006-10-08
BruteFIR v1.0i. Minor fixes in CLI. Sub-sample delay now works also
with negative delays.
There's also some interesting patent news, actually this is old news, but I did not know about it until now - the patent EP0649578 has been revoked after an opposition. It was argued that from the existing prior art (of which most is referenced here), the non-uniform partitioned part of the patent lacks inventive step. Additionally, there is a testimony that claims that the "invention" was actually exposed in advance - an academic person at a Danish university explained the idea to the people that then went home and filed the patent which has plagued the industry and open-source world for so long. A theft and lockdown of ideas which we so often before have seen in the world of patents. Anyway, based on this opposition, the EPO has revoked the patent.
For BruteFIR this does not mean anything since it employs uniform partitioned convolution. However, the non-uniform partitioned convolution algorithm is probably free to use in open-source software. There are still patents on this in other countries (such as the US), but with the corresponding patent revoked in Europe, they will be hard to defend. Note that I'm not a patent lawyer, so if you really are going to implement non-uniform partitioned convolution I recommend to consult a professional first, because there is no 100% clear prior art as in the uniform partitioned convolution case.
In the future, there might be a non-uniform version of BruteFIR, but not likely in the near future since it will require new design from the ground and up. The convolution principle is the same, but implementation is much different with non-uniform partitions, since you have to perform different size FFTs in parallel. The simplest idea would be to simply run the same convolution engine in several "layers" with different partition sizes and mixing together the result in the end. This way the implementation difference is small and could be realized quite easily in BruteFIR, but efficiency will suffer. I probably will rather spend time to do it from the ground and up. But not this year.
2006-07-12
BruteFIR v1.0h. Added a sub-sample delay function (that is delays
smaller than one sample can be specified), support for text format in
the file I/O module, and support for naming ports in the JACK I/O
module.
2006-03-30
BruteFIR v1.0g. Fixed input mixer and delay setting bugs.
2005-08-11
BruteFIR v1.0f. Fixed a filter parse bug.
2005-06-28
BruteFIR v1.0e. Fix to work with GCC 4.0.
2005-06-12
Released a minor maintenance release, 1.0d. Contains some minor
adjustments to the JACK I/O module, and a fatal bug fix concerning
multiple inputs/outputs, which was introduced in 1.0b.
2005-01-04
BruteFIR v1.0c. Mistake in 1.0b caused the CLI module not accepting
return characters, causing problems for telnet operation. Fixed that.
2004-11-21
BruteFIR v1.0b. Updated the JACK I/O module, it is now possible to run
several BruteFIR instances using JACK at the same time, and it is not
necessary to connect to external ports at startup. The CLI can now
take commands from a serial line. Additionally, a couple of remaining
bugs in the equalizer module have been fixed.
2004-08-07
BruteFIR v1.0a. Minor update, removed the coefficient set limit and
updated the example configuration files in the package.
2004-04-21
BruteFIR v1.0. I felt it was time to release 1.0 now. I have fixed up
the code so it compiles on FreeBSD and Solaris again, and I added an
OSS module, which makes BruteFIR truly usable on FreeBSD platforms.
2004-02-22
BruteFIR v0.99n. As suspected, the merge function was not good enough,
and has now been removed. However, instead, a cross-fade algorithm has
been added, which indeed is a bit costly in terms of CPU time, but
makes coefficient changes truly seamless.
2004-01-17
BruteFIR v0.99m. Fixed a few bugs, and updated the ALSA code to
support the 1.0 version of the API. Now it is also possible to make more
time-precise CLI scripting. I'm suspecting that the merge function is
not good enough to be very useful, I may remove it or replace it with
a better sounding (but inefficient) cross-fade algorithm.
2003-10-26
BruteFIR v0.99l. Added a function to hide discontinuities that may
occur when filter coefficients are changed in runtime (function is
called "merge"). Also added a skip option to coefficient loaded, to
skip a given number of bytes in the beginning of a file.
2003-08-10
BruteFIR v0.99k. This is a maintenance release, which fixes a few
bugs, including a severe powersave bug which could cause unexpected
and very loud noise come out. It also adds an option to run in daemon mode.
2003-07-11
BruteFIR v0.99j. Now we are getting near a 1.0 release. This release
contains quite many new features, and bug fixes. Some feature
highlights: BruteFIR now employs FFTW3, there is support for 32 and 64
bits in the same binary and buffer over/underflows can be
ignored. Among important bug fixes are that FFTW wisdom is now stored
properly, so it can be re-used more often, and the equalizer module
now sets the magnitude properly at the edges.
2003-02-11
BruteFIR v0.99i. I released the h-version a bit too early, lots of
small but significant mistakes followed. This version fixes those
(hopefully).
2003-02-09
BruteFIR v0.99h. A couple of bug fixes associated to the new callback
I/O. It also adds support for native endian and auto sample formats,
and a simple automatic load balancer for multi-processor machines.
2003-02-02
BruteFIR v0.99g. This release adds support for callback I/O. One
callback I/O module is available, supporting JACK. This support means
that the program has went through quite radical reorganizations, so
something might be broke. If you discover any problems, please let me know.
2003-01-05
BruteFIR v0.99f. Minor peak meter adjustment and bug fix.
2002-12-25
BruteFIR v0.99e. Lots of tuning have been made to work better with
sound card I/O. It should now be more reliable in low latency
configurations. The release also includes some various minor
improvements and bug fixes.
For those that find the default configuration file unnecessary and
just in the way, there is now the -nodefault
command line
option, which will cause BruteFIR to skip the default configuration
file.
2002-11-28
BruteFIR v0.99d. Fixes yet another bug in the ALSA code, which caused
the software not to work with hardware with odd period sizes, such as
some (all?) ice1712-based cards. The real-time index has also been
much simplified and improved in terms of reliability, and a power-save
feature was added.
Sometime soon, there will be 1.0...
2002-10-10
BruteFIR v0.99c, is an important bug fix release. Among other fixes,
it fixes the slightly embarrassing bug of incorrect reading of
3 byte 24 bit formats. Apart from many bug fixes, it adds double
buffer support to the equalizer module, and a simple script function
to the CLI. The risk of buffer underflow at startup has also been
strongly reduced.
2002-09-12
BruteFIR v0.99b, fixed a serious bug in the ALSA code, which caused
buffer underflow when the software buffer size was larger than the
hardware buffer size.
2002-08-25
BruteFIR v0.99a, a couple of minor bug fixes, discovered during the
development of AlmusVCU.
2002-08-04
This new release (v0.99) contains a first version of an equalizer
module, which allows equalization to be changed in runtime. Now the
I/O delay is fixed, always exactly twice the filter block length
(if the sound card hardware is properly designed). Good for
synchronization with other audio processors, or clustering. There is
also a slight change in configuration file format, so you know why it
will complain when run with an old configuration file.
2002-07-26
Added a minor feature that proved necessary for some applications,
such as Ambisonics. This feature makes it is possible to multiply
inputs/outputs in mixing with negative values, not just positive. The
new version is BruteFIR 0.98e. An invalid version was available a few
hours during this day (forgot to include some CLI patches), so if you
downloaded your v0.98e at this date, download it again.
2002-07-21
Two bug fixes in this new release, BruteFIR 0.98d. The first concerns
scaling of coefficient parameters, where PCM coefficients where
incorrectly scaled. The other fix is in the ALSA I/O module, which
could at some occasions fail to set the sample rate.
2002-06-14
BruteFIR 0.98c, another small step towards 1.0. This contains an
important bugfix. Earlier versions could mix up the mix buffers which
caused looping sound with some filter configurations, this is now
fixed. The common mistake (at least for me) to link a 32 bit BruteFIR
with a 64 bit FFTW or the other way around is now taken care of.
2002-05-05
BruteFIR 0.98b. The sample rate monitoring added in 0.98a is now
optional, through the option monitor_rate. Also support for SSE2 for
Pentium 4 processors is implemented (only used when compiled with
double precision). It is also possible to compile and run on Solaris
with Sparc processors.
2002-04-16
Yet another of the usual minor updates: BruteFIR 0.98a. This fixes a
minor bug which could cause stray processes to be left after exit. It
also improves the real-time index calculation so it works properly on
SMP, and the program now exits with an error when sample rate is
changed in runtime. There are now interpretable exit codes from the
program as well, so one can now why it exited.
2002-03-25
BruteFIR 0.98: This new release supports virtual inputs and outputs,
which can be used to control delay of individual outputs even if they
are mixed to the same physical output.
2001-12-20
Another bugfix release, 0.97d. Also added a -quiet
command
line parameter to suppress title, warnings and informational messages
at startup.
2001-12-17
Due to popular demand, the ALSA I/O module has got support for
accessing the software modes of the ALSA library. The new release is
0.97c.
2001-12-16
Ooops. The new sample format handling was not as good as I initially
thought. Now that has been fixed. Oh, clipping for 32 bit formats
works again. I hope I did not burst anyone's ears (other than
mine). The release version is 0.97b.
2001-12-15
Some major bugs was introduced in 0.97, hopefully most of them has
been squashed in this new release, 0.97a.
2001-12-09
BruteFIR 0.97: a new release with lots of major changes. The software is
now much more modular. It uses modules for input and output, ALSA and
file I/O being the first modules available. It also supports
logic modules, the old BruteFIR CLI being the first example. The logic
modules can be used to achieve adaptive filtering. The new module
architecture will probably need some time to stabilize, and due to the
large amount of changes to the code, there is a great risk that this
new version is less stable than the last. A few details in the
configuration file format has changed as well, for which the
documentation has been updated. The documentation for how to program a
BruteFIR module is not yet available though.
2001-11-04
Added a todo list. Any suggestions are welcome of course.
2001-10-27
Added some quick and dirty benchmarks, and added some new
documentation. I made a low latency benchmark due to popular demand,
and the interesting result is that it is possible to get as low as
three milliseconds I/O delay, which is much lower than what I expected.
2001-09-27
New release, BruteFIR 0.96a. Some minor bugfixes, and at last
processor capability detection code has been included, so BruteFIR
will detect SSE or 3DNow, and use the optimized code accordingly.
2001-08-26
Updated documentation to cover all the new features of BruteFIR 0.96.
2001-08-20
BruteFIR 0.96 has been released, with a few important bugfixes, but
also much new features, which not yet has been documented here. It is
now possible to make filter networks, and have different length
on different filters.
2001-07-18
A new release, BruteFIR 0.95b, which contains an important bugfix is
available for download. It fixes a block bounds violation error when
converting from 32 bit integers to floating point. It also contains
some tuning of realtime priorities.
2001-06-10
Some minor updates to the documentation.
2001-06-03
A bugfix release, BruteFIR 0.95a, is available for download. It fixes
a bug which caused the program to crash when long filters in raw
format was read.
The documentation is now up to date again.
2001-05-26
New release, BruteFIR 0.95. This includes some new features, for
example support for changing delay in runtime and support for
non-interleaved sound cards. An important bug fix has also been
applied, when mixing files and sound cards for inputs/outputs trouble
could occur, but that should be fixed now.
Again, the documentation on this page is not entirely up to date with the software itself.
2001-04-11
BruteFIR 0.94a released, which is a bugfix release. A severe bug in
the ALSA support code caused the error "Hardware does not support
enough fragments." with common sound cards. Now it is gone. Still
there is some work to do on the ALSA support code, like adding support
for cards with non-interleaved buffer layout (like the RME9652).
2001-04-08
Major changes and cleanups of this page has been done, and the source
code has been re-released. The new version is 0.94, and contains a new
improved convolution algorithm with hand-coded assembler optimizations
for Intel's SSE and AMD's 3Dnow. With this, BruteFIR is now capable of
even higher throughput.
Note that the core of this documentation was written 1999 — 2001 and is thus old. It's up to date regarding how to configure BruteFIR, but there's many references to old kernel versions and old CPUs embedded in here.
BruteFIR is a software convolution engine, a program for applying long FIR filters to multi-channel digital audio, either offline or in realtime. Its basic operation is specified through a configuration file, and filters, attenuation and delay can be changed in runtime through a simple command line interface. The FIR filter algorithm used is an optimized frequency domain algorithm, partly implemented in hand-coded assembler, thus throughput is extremely high. In realtime, a standard computer can typically run more than 10 channels with more than 60000 filter taps each.
Through its highly modular design, things like adaptive filtering, signal generators and sample I/O are easily added, extended and modified, without the need to alter the program itself.
BruteFIR is free and open-source. It is licensed through the GNU General Public License [6].
The preferred operating system platform for the program is Linux [11], but it is easily ported to other Unixes as well, and supports for example FreeBSD out of the box. BruteFIR uses the high-performance FFTW library [7] for the Fast Fourier Transform (FFT, [5]) calculations, and ALSA, the Advanced Linux Sound Architecture [2], is the preferred way of interfacing sound cards, although OSS, Open Sound System [25], is supported as well. The main features are:
A few examples of applications where BruteFIR could be a central component:
If you are interested in room equalization, my old NWFIIR project [18] might be of interest. It's a bit dated though. A better program for room equalization is Denis Sbragion's DRC [22].
The main design goal of BruteFIR is to achieve as high throughput as possible when filters are long (longer than 10000 taps). This means that the filter algorithm must be very fast, since it will be consuming almost all processor time of the whole program. BruteFIR's convolution algorithm is an example of a situation where a theoretically less efficient algorithm is faster in practice, because it is easily optimized and hides performance problems of more complex components.
Frequency domain algorithms for convolution is much faster than the straight-forward time domain one when filters are long. The well known overlap-save algorithm is used as the base in BruteFIR's convolution. However, there are practical problems with this algorithm as we will see.
Efficient convolution is done in the frequency domain and therefore an FFT algorithm is needed. The FFT calculations occupy typically more than 90% of all processing time when plain overlap-save is employed. Unfortunately, FFT it is not easy to implement. There exist numerous implementations which vary greatly in performance, which is one proof of the complexity. Since it takes up almost all processing time, we must optimize it in order to make the convolution faster. This leaves us with a quite hard optimization problem.
One way to optimize is to code assembler by hand and try to be better than the compiler. Modern processors for personal computers like Intel's Pentium III [10] or AMD's Athlon [1] has custom SIMD instructions (Single Instruction Multiple Data), which allows for a single instruction to operate on more than one data element at a time. For example, a single instruction may add together four or eight floating point numbers. Typically, one can improve the performance of an algorithm four times when using these instructions. They are not used by common compilers like GCC (GNU Compiler Collection [9]), meaning that we have a good opportunity to write assembler code that will with a wide margin outperform code generated by the compiler. Most FFT libraries are written in C, and thus does not use these efficient SIMD instructions. So, theoretically, we could implement an FFT algorithm using SIMD instructions and beat the ones already available. However, we are going for a simpler approach as we shall see. Since one of the design goals of BruteFIR is to be fairly portable, we want to make any assembler implementation small and simple, so it easily can be ported to other processor architectures. Maybe 'small', but certainly not 'simple' would be applicable on an assembler implementation of FFT. In conclusion, we find optimization with assembler as an attractive method to increase performance of existing algorithms. However, the algorithm we need to optimize, FFT, is quite complex and thus not an attractive target for optimization.
One of the fastest FFT libraries available is FFTW [7], [8], which is used by BruteFIR. There are more efficient FFT libraries out there (?), but they are often limited to short lengths (typically less than 8192), or are not free software nor open-source, which is a requirement of the BruteFIR project.
These performance problems is of course due to memory accesses, and poor cooperation between the hardware caching architecture and the software. When the data of the algorithm exceeds the cache size, the problem becomes obvious.
Both Pentium and Athlon architectures allows for giving the cache hints from the software to reduce problems in these situations, but this must be done in assembler, and is therefore seldom used.
Apart from performance problems, long FFTs include more multiplications and scalings which induces a larger quantization error. This is however a minor problem (?).
We have seen that the central algorithm of fast convolution, the Fast Fourier Transform, is complex to implement and optimize. We have also seen that the need of long FFTs reduces the choices of available implementations and that the existing can behave poorly on some hardware architectures. A modified fast convolution algorithm that uses shorter FFTs, and where most time is spent in code which is small and easily optimized, would be ideal.
Many have worked on improving the standard frequency domain convolution algorithms for different purposes. The central idea found in many of these improvements, is that the impulse response, that is the filter, is partitioned into several smaller parts. When each part is filtered with the input, the results delayed suitably and finally added together, one gets the same result as when processing the whole filter at once. As far as I know, the earliest user of this simple but powerful concept is T.G. Stockham [16], who published his results only one year after the famous Cooley and Tukey FFT paper [5]. The concept can be used to solve several problems. Stockham used it for saving memory, but in later work made in the eighties and early nineties, at the time when realtime DSP became feasible for the first time, it was stated that it can also be used to reduce quantization errors, reduce I/O-delay, and adapt to optimal FFT lengths of a specific implementation. All these improvements are described by J.S. Soo and K.K. Pang [14], [15]. Other realtime partitioned convolution pioneers are B.D. Kulp [17], P.C.W. Sommen [12], [13] and J.M.P. Borrallo and M. G. Otero [4]. Their work is a good place to start reading for the one interested in getting a more detailed description of partitioned convolution. The convolution algorithm in BruteFIR is conceptually exactly the same as the one found in these papers.
When partitioned convolution is used, something interesting happens in the processing time distribution of the algorithm. The major part of processing is moved from the FFT algorithm, to the trivial operation of convolution in the frequency domain which is simply multiplication. The more parts we split the impulse response into, the more convolution and less FFT is done. Naturally the FFTs get shorter, and thus we get rid of the problems associated to long FFTs. We now realize that partitioned convolution is the answer to our wishes, we do not need long FFTs and it becomes less important to optimize the FFT algorithm.
We notice that we will earn most from optimizing the operation where a segment of input converted to the frequency domain is multiplied with the corresponding part of the filter also in the frequency domain. The result is then added to the output. When the data format is half-complex, a format used by most real-valued FFTs, The straight-forward implementation look like this when programmed in C:
d[0] += b[0] * c[0]; for (n = 1; n < n_fft / 2; n++) { d[n] += b[n] * c[n] - b[n_fft - n] * c[n_fft - n]; d[n_fft - n] += b[n] * c[n_fft - n] + b[n_fft - n] * c[n]; } d[n] += b[n] * c[n];
b
is the input, c
is the filter coefficients, and
d
is the output. As we see, this is a very short and simple
algorithm, which is easy to implement in assembler. There are a couple
of problems though. The data in each array is accessed from the tail
and the front at the same time. It would be better for the cache to
localize the accesses, and move from front to end only. It is also a
problem that the data is accessed both in forward and reverse order
(both 0,1,2,3 and 3,2,1,0), since we want to used SIMD
instructions. To solve the problem, we need to reorder the data. This
will only be necessary to do once with the filter coefficients, so it is
free. For the input however, we need to do this once after each forward
transform, and for the output we need to restore the half-complex
order prior to each inverse transform. In BruteFIR the input reordering
is put into the mixing and scaling step, and the output reordering in
the quantization step, so the cost is next to nothing. Below is a
C implementation of the previous algorithm, when data has been
reordered to better fit SIMD instructions and to improve the memory
access pattern:
d1s = d[0] + b[0] * c[0]; d2s = d[4] + b[4] * c[4]; for (n = 0; n < n_fft; n += 8) { d[n+0] += b[n+0] * c[n+0] - b[n+4] * c[n+4]; d[n+1] += b[n+1] * c[n+1] - b[n+5] * c[n+5]; d[n+2] += b[n+2] * c[n+2] - b[n+6] * c[n+6]; d[n+3] += b[n+3] * c[n+3] - b[n+7] * c[n+7]; d[n+4] += b[n+0] * c[n+4] + b[n+4] * c[n+0]; d[n+5] += b[n+1] * c[n+5] + b[n+5] * c[n+1]; d[n+6] += b[n+2] * c[n+6] + b[n+6] * c[n+2]; d[n+7] += b[n+3] * c[n+7] + b[n+7] * c[n+3]; } d[0] = d1s; d[4] = d2s;
The above function is easily converted into assembler using Intel's SSE instructions, or AMD's 3Dnow instructions, with cache hint instructions. The key loop (which is unrolled to further improve performance) becomes less than 50 lines long.
It is interesting that partitioned convolution makes much more memory references than ordinary overlap-save. In the most simple algorithm analysis, only the number of mathematical operations (like multiplications and additions) are considered when evaluating performance. Better analysis also counts the number of memory references, but unfortunately that is not enough considering the modern computer architecture; it is also of profound importance to take how the accesses are done into consideration. One bad reference can be worse in terms of performance than ten good ones on a modern computer.
By implementing partitioned convolution we have avoided the need of using long FFTs, and moved the major part of the processing time from the FFT to a simple multiplication loop. By reordering data after the forward transform and restoring it prior to inverse transform, the multiplication loop can be easily realized with SIMD instructions, and thus become very efficient. On the 900 MHz AMD Athlon test system, filtering of a 131072 tap long filter is twice as fast when 16 partitions of 8192 taps each are used instead of a single partition (note: this test case is exceptional, the performance improvement is less in the common case). This despite the new algorithm uses more memory references and more mathematical operations.
Apart from the improvement in throughput, we also get lower I/O-delay (equals about twice the partition length), lower memory consumption, and more flexible filter length options. A 140000 tap filter would require a 262144 tap filter if ordinary overlap-save was used, but with partitioned convolution we can use 18 partitions of 8192 taps, and then get a gross performance improvement, coupled with delay reduction.
Still, one must not over-estimate partitioned convolution. If there really is an optimal FFT algorithm available, ordinary overlap-save will certainly outperform the partitioned algorithm. An example of an assembler-optimized FFT algorithm can be found in the non-free and non-portable Intel Native signaling processing library [19].
You are free to download version 1.0o.
The package contains the source-code, you will need a supported platform to run it on (Linux is recommended, but FreeBSD or Solaris should work out of the box too, it is not as closely maintained though). Apart from the basic stuff you must also have FFTW3 installed (note that FFTW2, as used by old versions of BruteFIR, won't work). FFTW3 must be compiled for both double and single precision.
If you want sound card support, it is recommended to use ALSA on Linux platforms, and when that is not available, OSS can be used.
If you want to use the JACK support, you need an up to date version of JACK installed.
Be sure that you use an official GCC compiler when compiling BruteFIR. One user reported bad sound quality (noise artifacts in the BruteFIR output), and it was shown that he had used GCC 2.96 (not an official version), that caused errors in the floating point calculations of BruteFIR.
The package does not yet contain configure scripts or other nice things to make compiling easier. However, with some luck it should work simply by typing 'make'. You can also view the Makefile to see what compile options there are. If you have any questions, just mail me (address in the footer).
BruteFIR's main feature is that is fast. It's brutally fast. The key component making BruteFIR fast is the convolution algorithm described above.
Note: the test descriptions here are a bit dated, made using an old version of BruteFIR. However, the results should provide a rough idea of what BruteFIR can do in terms of throughput. The example configuration files have been updated to work with the current version.
With a massive convolution configuration file setting up BruteFIR to run 26 filters, each 131072 taps long, each connected to its own input and output (that is 26 inputs and outputs), meaning a total of 3407872 filter taps, a 1 GHz AMD Athlon with 266 MHz DDR RAM gets about 90% processor load, and can successfully run it in real time. The sample rate was 44.1 kHz, BruteFIR was compiled with 32 bit floating point precision, and the I/O delay was set to 375 ms. The sound card used was an RME Audio Hammerfall.
BruteFIR is mainly designed for high throughput, not low delay. However, there is an interest of using BruteFIR for low delay convolution anyway, so here are some benchmarks so you know what to expect. Partitioned convolution can indeed allow for quite low delay, very low if the processing power is available, and the filters are not too long.
Below is an example of a simple cross-talk cancellation application running on a 1 GHz AMD Athlon with 266 MHz DDR RAM and an RME Audio Hammerfall sound card. You can download the cross-talk cancellation configuration file that was used if you want to test yourself. There are only four filters and their length are no more than 8192 taps (note: the example files included in the package are only 4096 taps long, as seen in the updated example configuration file), so it is indeed a very light application, which is a requirement if you want very low delay, since partitioned convolution does not scale very well with low delays (meaning a large number of partitions). The sample rate in these tests is 44.1 kHz, and BruteFIR was running with 32 bit floating point precision.
delay in ms | processor load | partition size | number of partitions |
---|---|---|---|
3 ms | 60% | 64 samples | 128 |
6 ms | 30% | 128 samples | 64 |
12 ms | 16% | 256 samples | 32 |
24 ms | 11% | 512 samples | 16 |
47 ms | 8% | 1024 samples | 8 |
As seen in the table, BruteFIR allows for as low delay as 3 milliseconds, which is the limit of the sound card used, which cannot have shorter than 64 sample partitions.
If you want to run BruteFIR to achieve high throughput, you should expect to have a delay of at least 100 ms though (and using no more than 16 partitions or so).
If you try to run BruteFIR with shorter delay than the computer can handle, or with too long filters, the program will exit with a broken pipe signal. If you get broken pipe only after a while, this is probably due to that you have not applied a good low latency patch to the kernel (there are bad ones as well), or you have cron jobs running or other software that competes for using the processor. For reasonable low latency, a low latency kernel can handle other processes running, but for as low as 3 milliseconds like in this example, you should have a dedicated clean system for running BruteFIR.
Note: the hardware referenced here is a bit dated (a long time ago the text was written), but apart from that, the text is up to date.
What is important for BruteFIR is that the machine has fast memory and fast processor. A Pentium 4 with its RDRAM is probably the best choice today. However, an Athlon with DDR RAM is not bad either, and significantly cheaper. A fast processor on a computer with slow memory is what most often causes disappointment. For example, a dual Pentium III at 1 GHz with good use of both processors was found to be slower than a single processor 1 GHz AMD Athlon with DDR RAM. The problem was that the Pentium III had poor memory performance. The stream benchmark [20] is a good program to use to verify the memory bandwidth if you think you get poor BruteFIR performance.
If you use SDRAM you will never get exceptional memory bandwidth, however, some tuning of timer settings in the BIOS, or overclocking of the memory bus can give you quite decent performance.
When it comes to sound hardware, you should be able to use any card that is compatible with ALSA [2]. However, it is not very likely that the sound card code of BruteFIR will work for all sound cards supported by ALSA, although that is the goal. If you get problems with your sound card, please send me a mail, and I will do my best to get it to work, or even better, try to get it to work yourself and send me a patch.
The best sound cards are those which support partition sizes which are a powers of two. If that is not the case, BruteFIR must run in input poll mode, which is not necessarily less reliable, but will consume a part of the spare processor time.
The worst possible sound card is one which does not support partition sizes with a power of two, and can only transfer large sample blocks at a time. Then BruteFIR will run unreliably or not at all.
If you want to avoid problems I recommend RME Audio [21] Hammerfall (Light) (RME9652 and RME9636) and also cards from the RME Audio Digi96 series (RME96), since those are the cards I use myself. The Hammerfall cards support up to 26 inputs and 26 outputs, the Digi96 cards support up to 8 channels. They are not the cheapest cards out there, but these are clean professional cards, fully digital with ADAT and S/PDIF inputs and outputs, which means you can have high-quality DACs and ADCs outside the computer to get the best sonic performance possible.
The Hammerfall cards allow for shorter delay (minimum partition size is 64 samples) than the Digi96 series (minimum size 1024 samples).
When BruteFIR is run for the first time (without parameters), it will
generate a default configuration file (~/.brutefir_defaults
)
(if not the -nodefault
option is used), and then complain
that it cannot find .brutefir_config
in the
home directory, which is the default location. The default
configuration file contains default settings, which is extended and/or
overridden in the main configuration file. A setting that is specified
in the default configuration file, is not necessary to be listed in
the main configuration file.
BruteFIR takes only four parameters, namely the
filename of the main configuration file, and optionally
-quiet
to suppress title, warnings and informational messages
at startup, and -nodefault
if BruteFIR should read all
settings from the main configuration file, and finally
-daemon
if it should run as a daemon.
If no parameters are given, the filename given in the default configuration file is used. If the filename is "stdin", BruteFIR will expect the configuration file to be available on the standard input.
The (default) default configuration file looks like this:
## DEFAULT GENERAL SETTINGS ## float_bits: 32; # internal floating point precision sampling_rate: 44100; # sampling rate in Hz of audio interfaces filter_length: 65536; # length of filters config_file: "~/.brutefir_config"; # standard location of main config file overflow_warnings: true; # echo warnings to stderr if overflow occurs show_progress: true; # echo filtering progress to stderr max_dither_table_size: 0; # maximum size in bytes of precalculated dither allow_poll_mode: false; # allow use of input poll mode modules_path: "."; # extra path where to find BruteFIR modules powersave: false; # pause filtering when input is zero monitor_rate: false; # monitor sample rate lock_memory: true; # try to lock memory if realtime prio is set sdf_length: -1; # subsample filter half length in samples convolver_config: "~/.brutefir_convolver"; # location of convolver config file ## COEFF DEFAULTS ## coeff { format: "text"; # file format attenuation: 0.0; # attenuation in dB blocks: -1; # how long in blocks skip: 0; # how many bytes to skip shared_mem: false; # allocate in shared memory }; ## INPUT DEFAULTS ## input { device: "file" {}; # module and parameters to get audio sample: "S16_LE"; # sample format channels: 2/0,1; # number of open channels / which to use delay: 0,0; # delay in samples for each channel maxdelay: -1; # max delay for variable delays mute: false, false; # mute active on startup for each channel }; ## OUTPUT DEFAULTS ## output { device: "file" {}; # module and parameters to put audio sample: "S16_LE"; # sample format channels: 2/0,1; # number of open channels / which to use delay: 0,0; # delay in samples for each channel maxdelay: -1; # max delay for variable delays mute: false, false; # mute active on startup for each channel dither: false; # apply dither merge: false; # merge discontinuities at coeff change }; ## FILTER DEFAULTS ## filter { process: -1; # process index to run in (-1 means auto) delay: 0; # predelay, in blocks crossfade: false; # crossfade when coefficient is changed };
The syntax of the main configuration file is very similar as we will see. As we can see, there are five sections in the configuration:
The general syntax rules for the configuration files is easily grasped from the default configuration file. The semicolons are important, they note the end of a setting, not line breaks, so you may have several settings on one line if you like. All characters on a line after a # is found are ignored. There are three data types: strings, numbers and booleans. Strings are text between quotes, a number is either with or without a decimal dot, and a boolean is either 'true' or 'false'.
Note that everything is case sensitive, so setting names must be written with small letters. Although the configuration file examples shown here is nicely ordered in sections, it is perfectly alright to mix settings in any order you like.
The general settings section in the main configuration file has the
same syntax as in the default configuration file. The difference is
that coeff
, input
, output
and
filter
structures can exist in multiples, and are given names and
more parameters.
Default values of all general settings (except logic
) must be
given in the default configuration file. Any of these settings may be
overridden in the main configuration file (except
config_file
). These settings are:
float_bits: <NUMBER: internal floating point resolution, either 32 or 64>; sampling_rate: <NUMBER: sampling rate in Hz>; filter_length: <NUMBER: length in samples of the (sub)filters>[,<NUMBER: number of subfilters per filter>];; config_file: <STRING: default location of main configuration file>; overflow_warnings: <BOOLEAN: echo overflow warnings to stderr>; show_progress: <BOOLEAN: echo progress to stderr>; max_dither_table_size: <NUMBER: maximum size in bytes of pre-calculated dither>; allow_poll_mode: <BOOLEAN: allow input poll mode>; modules_path: <STRING: extra path where to find BruteFIR modules>; logic: <STRING: logic module name> { <logic module parameters> }[, ...]; powersave: <BOOLEAN or NUMBER: pause filtering when input is zero>; monitor_rate: <BOOLEAN: monitor sample rate, and abort if it changes>; lock_memory: <BOOLEAN: try to lock memory if realtime prio is set>; sdf_length: <NUMBER: sub-sample delay filter half length in samples>[, <NUMBER: kaiser window beta>]; convolver_config: <STRING: file to store FFTW wisdom in>; benchmark: <BOOLEAN: start in benchmark mode (can only be used in main config file)>; safety_limit: <NUMBER: if non-zero max dB in output before aborting>;
The filter_length
setting specifies how long the filters
should be. This can be done in two ways. Either by specifying the
length in one number, which must be a power of two. If so, the
convolution will be done on the whole filter length. To partition a
65536 tap filter in 16 parts, you write filter_length:
4096,16
. Partitioned filters can be used to improve performance
and reduce I/O-delay.
The convolver_config
setting specifies where FFTW wisdom should be
stored, that is optimization information for the FFT
calculations.
If overflow_warnings
is set to true, information about
overflows will be printed to the screen when they occur. Note that
overflowed samples are always set to the maximum output value of the
output device, so there is no actual overflow on the output (unless
the actual floating point value is overflowed). If overflow occurs, it
means that the filter is amplifying too much, either through
its coefficients or through input and output attenuation. Overflow is
not checked for if the output values are floating point.
If dither is applied to any output, a dither table will be calculated
when the program is started. It contains uncorrelated random values
that is used to generate the dither. The more channels that applies
dither, the larger table is needed, if to keep the dither uncorrelated
between channels. This table can get quite large memory-wise. If you
want to limit its size, set max_dither_table_size
to a
value. It should rather not be less than one megabyte though. If it is
set to zero or negative, the program will itself choose a size.
BruteFIR uses external modules to provide sample I/O, and optionally
add new logic. It will search a few default directories to find any
modules that should be loaded, as specified in the configuration. The
setting modules_path
will add an extra directory, which is
searched first. The value in the created default configuration file
will be ".", that is the current working directory.
If any logic modules should be loaded, these are listed in the
logic
field, in pairs of module name / module parameters,
separated with commas. Which logic modules that are available and what
functionality they provide can be found in the
Logic modules section.
If there is any sound card used for input or output (or any other
sample-clock dependent device), BruteFIR will automatically set its
delay-sensitive processes to realtime priority, thus you will
typically need to run the program as root. To maintain realtime
performance, it is important that there is no memory belonging to the
program in the swapfile, thus all memory must be locked to RAM. This
is done if lock_memory
is set to true. Note that the memory
is never locked when realtime priority is not set (that is when there
are only files used for input and output). Warning:
there seems to be a bug in the Linux kernel which makes the shared
memory to be locked one time for each process, meaning that when
lock_memory
is set to true, BruteFIR will seem to consume a
lot more memory than it should. Also, it makes of course no sense to
lock memory if your system does not have a swap activated. Due to this
issue, the best thing to do is to have a system with no swap and avoid
locking the memory.
The powersave feature if activated, will monitor the inputs, and if an input channel provides zero samples, the associated filters will not do any processing, since with zero on the input, BruteFIR knows in advance that there will be zero on the output. BruteFIR will continue run as normal, and filters with non-zero inputs will continue to to process normally. As soon as there is non-zero input on a suspended filter, it starts processing again. This powersave feature is transparent, there will be no convolution errors if it is activated. The reason for having it optional is that one may want to make performance tests, without the need to feed a meaningful signal to BruteFIR.
If analog inputs are used, the input will never be exactly zero, and
thus the powersave feature will not be triggered. However, if a
value is specified instead of the boolean (for example
powersave: -80;
), that value is interpreted as the
lowest level in dB the input signal can be, before BruteFIR will
consider the input as zero, and trigger powersave. Thus, a noise floor
can be specified, and then powersave can work together with analog
inputs.
If benchmark mode is activated (can only be done in the main configuration file), performance statistics will be printed on screen. Note that due to complex caching effects of modern computers, the displayed processing times can look strange, a step that requires much more arithmetic operations than another may in certain circumstances still be considerably faster, if it has better luck with the cache. Since benchmarking measures elapsed time, the computer must not be loaded with any other tasks in order to get reliable results.
If a sound card which is used for input cannot be configured to have a
period size (interrupt interval) equal to or smaller than the
configured filter (partition) length, or if it is cannot be a power of
two, BruteFIR must be run in input poll mode. This means that the
sound card is polled for data, and sound card interrupts are not
used. BruteFIR will run just as reliably (as long as the sound card
allows for small transfers) but will consume more of the spare
processor time. Thus it will look like BruteFIR uses more processor
than it actually needs to. If more processor time is used for
filtering, less will be used for polling, thus input poll mode does
not mean that it is not possible to have as long filters as running in
normal mode. However, for some applications (for example when the
spare processor time is used by another vital program), input poll
mode is not suitable, and by setting the allow_poll_mode
to
false, BruteFIR will exit with an error if input poll mode is
required.
If subsample delays should be possible to set, the sdf_length
setting must be larger than zero. It specifies the half length of a
sub-sample delay filter. A sub-sample delay filter is simply a sinc
sampled with a sub-sample offset. Thus, when a signal is convolved
with the filter it is delayed with the corresponding offset. Since a
sinc signal is infinitely long, it must be windowed. A kaiser window
is used, default beta is 9.0, but an own value can be specified by
adding it after a comma (example: sdf_length: 31, 8.5;
),
there is little reason to use other than the default though. The
distortion caused by the windowing is a soft rolloff at higher
frequencies, the shape depends on the beta value. There is no phase
distortion. Since the sub-sample filters are linear phase, they will
add a pre-response (in practice I/O-delay), which is their half filter
length, that is the value given after the sdf_length
setting. If sub-sample delay are used only on inputs or outputs, the
added pre-response is the same as the sdf_length
, if used on
both (usually not necessary), it will be twice the length. To activate
sub-sample delay, also a valid subdelay
must be specified in
at least one of the input/output structures. The valid range is -99 to
99.
The advantage of a long sub-sample filter length is that the rolloff
in the high frequencies starts later and gets sharper, that is less
high frequency information is lost. The disadvantage of long
sub-sample filters is that the required CPU time increases, and the
added I/O-delay increases. Sub-sample filters are processed separately
in the frequency domain using FFT, and therefore it is recommended to
keep sdf_length
at a power of two minus one (the actual
filter length is twice sdf_length
plus one), which means that
as much as possible of the FFT block is used (an sdf_length
of 16 requires as much CPU time as an sdf_length
of 31, since
the same block length is required). With an sdf_length
of 31
and the default beta of 9.0, and a sample rate of 44100 Hz, the
response is flat up to 19 kHz, and then a soft rolloff begins which
reaches -0.20 dB at 20 kHz, which is good enough for most needs. The
next natural step, 63, keeps a flat response up to about 20500 Hz,
with -0.20 dB at 21 kHz.
The purpose of the safety_limit
setting is to protect your
ears and expensive speakers, it's active if set to a non-zero
value. Every output sample is checked and if it exceeds this value (in
dB) BruteFIR will immediately exit with an error message, before any
sound is sent to the output.
<structure type name> <STRING: name (list for some) | NUMBER: index> { <field name 1>: <setting 1>; [...] };
Names of structures (given after the type name) is not given in the default configuration file, but must be provided in the main configuration file. The name is either a custom string, or an index number, which must then be the same as the order of the structure in the file, that is the first structure must be indexed 0, the second 1 and so on. If a string name is given, the index number is given automatically (the opposite also applies), and when referring to the structure, either the string name or the index number can be used. Some structures, namely input and output, may have a comma-separated list of names, since the names applies to the channels defined in the structure.
After the name, or the structure type name if in the default configuration file, There is a left brace ({), and then structure fields and their settings, each field/setting pair ending with semicolon (;). As for the general settings, field names always end with a colon (:). The order of the fields is not important. The structure is closed with a right brace (}) and ended with a semicolon.
coeff <STRING: name | NUMBER: index> { filename: <STRING: filename>; | <NUMBER: shmid>/<NUMBER: offset>/<NUMBER: blocks>[,...]; format: <STRING: sample format string | "text" | "processed">; attenuation: <NUMBER: attenuation in dB>; blocks: <NUMBER: length in blocks>; skip: <NUMBER: bytes to skip in beginning of file>; shared_mem: <BOOLEAN: allocate in shared mem> };
In the default configuration file, the filename
field is not
set, so it must be present in the main configuration file.
The coeff structure defines a set of filter coefficients, which becomes a FIR filter. There are several different file formats:
"text"
coefficients are listed in a text file, one
coefficient per line. They are parsed with the standard C library
strtod
function.
"processed"
coefficients are stored in the format
BruteFIR uses internally. Attenuation or adapted length cannot be
applied if this format is used.
Note that BruteFIR currently does not provide any way to convert other
formats to the "processed"
format (well actually it does, but
only through its module API).
The coefficients can be scaled, by setting the attenuation to non-zero.
Instead of a filename, comma-separated number groups can be given.
The first number will be a shared memory ID (man shmat) where the data
is found, the second number is the offset in bytes into the shared
memory area where the program starts to read, and the third is how
many blocks that should be read. A block is a filter segment, that is
if filter_length
is 4096, 16
one block is 4096
coefficients, and there can be no more than 16 blocks per coefficient
set. If not all blocks covered in the first group, there must be
following number groups to provide the full length. When a shared
memory segment is given, it is required that the format is
"processed"
.
In some cases, when one wants to test the performance of a certain
BruteFIR configuration, but don't feel like generating coefficients,
one can set the filename to "dirac pulse"
. Then BruteFIR will
generate a dirac pulse filter internally and use it as any other
filter, and thus will cost as much in processing as any other filter
of the same length. However, if you need a dirac pulse in the real
case, it makes no sense using this feature, since simply setting the
coeff field in the filter structure to -1 gives the same effect and
uses very little processor power (and memory).
The blocks
field says how long in filter blocks the coefficient
set should be. If it is set to -1, the full length is assumed. Note
that custom lengths are only possible if partitioned convolution is
employed (quite naturally, since else there will only be one filter
block covering the full length).
The skip
field if given specifies how many bytes in the
beginning of the file that should be skipped. This can be used to skip
headers in a file or similar. The field will be ignored if the
coefficients are not read from file.
The shared_mem
field indicates if the coefficient should be
stored in shared memory. Some modules may require that, such as the
equalization module.
input <STRING: name | NUMBER: index>[, ...] { device: <STRING: I/O module name> { <I/O module settings> }; sample: <STRING: sample format>; channels: <NUMBER: open channels>[/<NUMBER: channel index>[, ...]]; delay: <NUMBER: delay in samples>[, ...]; subdelay: <NUMBER: additional delay in 1/100th samples (valid range -99 - 99)>[, ...]; maxdelay: <NUMBER: maximum delay for dynamic changes>; individual_maxdelay: <NUMBER: maximum delay for dynamic changes>[, ...];; mute: <BOOLEAN: mute channel>[, ...]; mapping: <NUMBER: channel index>[, ...]; }; output <STRING: name | NUMBER: index>[, ...] { device: <same syntax as for the input structure>; sample: <same syntax as for the input structure>; channels: <same syntax as for the input structure>; delay: <same syntax as for the input structure>; subdelay: <NUMBER: additional delay in 1/100th samples (valid range -99 - 99)>[, ...]; maxdelay: <same syntax as for the input structure>; individual_maxdelay: <same syntax as for the input structure>; mute: <same syntax as for the input structure>; mapping: <same syntax as for the input structure>; dither: <BOOLEAN: apply dither>; merge: <BOOLEAN: merge discontinuities at coeff change>; };
All fields for the input and output structures except
mapping
, delay
and mute
must be set in the default configuration file.
The device field specifies the source/destination of the digital audio. This is always an I/O module. First the name of the module is stated, followed by a its configuration within {}. If the audio is read/written from/to a module which does not continue forever (for example reading from a file), BruteFIR will finish when the first I/O module comes to an end (hopefully an input module, write failure of an output module is considered an error).
The sample format should be one of the following strings:
The common format 16 bit signed little endian found in for example 16 bit wav-files is thus "S16_LE". The floating point formats can be in any range, however all integer formats will be scaled to -1.0 to +1.0 internally, so if to match an integer format, the range should be -1.0 to +1.0. There is no overflow checking for floating point formats (that is values larger than +1.0 or lesser than -1.0 is not truncated).
The channels field specifies the number of open and used channels of the device. If the number of open channels exceed the number of used channels, a slash (/) followed by a comma-separated list of channel indexes of used channels must be appended. If we for example have a eight channel ADAT sound card, but we only want to use the first two, we write 8/0,1 as the channels setting. As you see, the lowest channel index is zero, not one.
The length of the list of names (given after the structure type name)
must match or exceed the number of used channels. If there are more
channels in the head (the logical, or virtual channels) than there are
available through the device, the specified channels must be mapped
onto the physical device channels. This is done with the
mapping
field, which simply is a list of indexes, which index
in the head to map to which physical device channel. Here a simplified
example:
output 14,15,16 { ... channels: 8/5,4; mapping: 0,1,0; };
In this example, two channels from the eight channel device are used,
channels with index 5 and 4. The order of the channel indexes matter,
physical channel 5 will now be considered the first (index 0) of the
available physical channels, and 4 the second (index 1). The
mapping
fields tells how to map the channels called 14, 15
and 16 in the header to those two physical channels. The mapping is in
the same order as the channels in the header, that is 14 is mapped to
physical channel index 0 (which is channel 5 on the eight channel
device), 15 to index 1 (channel 4 on the device), and 16 to index 0,
that is the logical channels 14 and 16 will mix into the same output
on the device. In the standard case, where logical channels are the
same as the amount of channels made available through the
channels
field, a mapping
specification is not
needed. Then the first logical channel is mapped to the first listed
device channel and so on.
The list of delays specifies how many samples a channel should be
delayed. This could be used to compensate for speaker positions that
is either to close or too far away. It could also be used to
compensate for acasual filters. Delay can be changed in runtime, if
maxdelay
is not set to a negative value. It defines the upper
bound of delay in samples. When the program is started, delay buffers
for all channels to match maxdelay is allocated. If it is negative,
only the precise amount specified by the delay array is allocated.
The setting individual_maxdelay
was added later, and works
the same as maxdelay
with the difference that it is specified
per channel. It is useful to save memory when there are many channels,
and only some of them need dynamic delay (or considerably larger
buffer than the others).
If the general setting sdf_length
is larger than zero, the
subdelay
setting will take effect. It specifies the
sub-sample delay per channel in 1/100th of samples (valid range is -99
to 99). This delay can be changed in runtime. To disable sub-sample
delay on a channel, set its sub-delay to a negative value outside the
valid range. Since sub-sample delay consumes CPU time, it is
recommended to only activate it where necessary. Sub-delay filters
adds pre-response, and therefore all channels with sub-delay disabled
will be automatically compensated with an I/O delay to make them
aligned.
The mute list of booleans, specifies, in order, which channels that should be muted from the beginning. The muted channels can later be unmuted from the CLI.
If the dither flag is set to true, dither is applied on all used channels. Dither is a method to add carefully devised noise to improve the resolution. Although most modern recordings contain dither, they need to be re-dithered after they have been filtered for best resolution. Dither should be applied when the resolution is reduced, for example from 24 bits on the input to 16 bits on the output. However, one can claim that dither should always be applied, since the internal resolution is always higher than the output. When BruteFIR is compiled with single precision, it is not possible to apply dither to 24 bit output, since the internal resolution is not high enough. BruteFIR's dither algorithm is the highly efficient HP TPDF dither algorithm (High Pass Triangular Probability Distribution Function).
If the merge flag is set to true, discontinuities that may occur when coefficients are changed in runtime, is smoothed out with a simple merge algorithm. This avoids "clicks" that may occur in the sound when coefficients are changed. Note that discontinuities occurs also when volume is changed, but that is not merged, since those discontinuities are generally not audible or masked by the volume change itself. If someone does not agree with that, let me know, and I will make it apply the merger at volume changes too.
filter <STRING: name | NUMBER: index> { from_inputs: <STRING: name | NUMBER: index>[/<NUMBER:attenuation in dB>][/<NUMBER:multiplier>][, ...]; from_filters: <same syntax as from_inputs field>; to_outputs: <same syntax as from_inputs field>; to_filters: <STRING: name | NUMBER: index>[, ...]; process: <NUMBER: process index>; coeff: <STRING: name | NUMBER: index>; delay: <NUMBER: pre-delay in blocks>; crossfade: <BOOLEAN: cross-fade when coefficient is changed>; };
Only the process field should be given in the default configuration file.
The filter structure defines where a filter is placed and what its parameters are. This is done in a filter:
If an output channel exists in several filter structures, the filter outputs will be mixed into that channel. Thus, a set of filter structures defines how inputs and outputs should be copied, mixed and filtered.
With help of the from_filters
and to_filters
fields,
filters can be connected to each-other. The only real constraint is
that there must be no loops. BruteFIR will detect and point out errors
if such exist in a given filter network. Note that if possible
coefficients should be pre-convolved rather than put as filters in
series, since a 2N length filter computes much faster than two
cascaded N length filters.
The from_inputs, from_filters and to_outputs fields have the same
syntax. One channel/filter is given as the string name or index
number, and if attenuation should be applied, it is followed by a
slash (/) and attenuation in dB. Instead of, or combined with,
attenuation in dB, a multiplier can be given, a number which all
samples will be multiplied with. The writing "channel
1"/6/-1
means that channel 1 is attenuated 6 dB and the polarity
is changed (multiplication with -1). It is also possible to write
"channel 1"//-0.5
which is equivalent to the first example.
If more than one channel should be included, they are separated with
commas. The to_filters
field has the same syntax with the
exception that attenuation is not allowed.
The process field specifies in which Unix process the filter should be run. All filters with the same process index will run in the same process. Process index 0 must exist, and if there are more processes they should be in series, 0, 1, 2, 3 and so on. This field is important if BruteFIR runs on a multi-processor machine. The optimal situation is that there is one process per processor, and that each process requires the same processor time. Then you will get most out of your multi-processor computer. There is one limitation of how filters can be distributed between processes: mixing to an output channel or a filter input must be done within the same process.
If the process field is set to -1, an automatic but naive load balancing will take place, which may or may not be as good as a hand-made load balancing.
The coeff field defines which coefficient set that should be used for the filter. It could be given as the string name of the set, or as its index number. If the index number is set to minus one (-1), there will be no filtering in the filter, it will just mix and copy inputs/outputs as specified. Note that the length of the coefficient set specifies how processor intensive the filter will be.
The delay field specifies how many filter blocks pre-delay there should be. Zero or negative means no delay. The maximum allowed delay is one block less than full length. Thus, with unpartitioned filtering there can be no delay at all. The delay cost is zero both in terms of memory and processing.
If the crossfade
setting is set to true, there will be a
cross-fade when the coefficient is changed in runtime, making the
coefficient change totally seamless. This means that when changing
coefficient (using the CLI for example), the filter will convolve one
block with the old coefficient, fade out that and mix it with a fade
in block with the new coefficient. This means that at the
time of coefficient change, there will be roughly twice the amount of
processing for that filter. This processing spike can of course cause
buffer underflow if running with a sound card and heavy CPU load in
the normal case. If there for example are 10 filters in a
configuration (all with crossfade active), and all coefficients are
changed at the same time, the normal CPU load should not exceed 50%,
since the spike will roughly require twice the load. However, if the
coefficients are changed only one filter at a time, only 10% extra
processing is required compared to the normal case in the example.
Here follows an example of a main configuration file, showing some of the aspects of BruteFIR's possibilities. It implements a cross talk cancellation filter for a stereo dipole. The two filters are placed in two processes get the max out of a dual processor machine. A computer with a single processor should if possible keep all filters within the same process for best performance. Note that the configuration uses the default settings extensively. For example, no general settings have been specified apart from the addition of the CLI logic module, and in the coeff structures, only the filename field is used.
logic: "cli" { port: 3000; }; coeff "direct path" { filename: "direct_path.txt"; }; coeff "cross path" { filename: "cross_path.txt"; }; input "left", "right" { device: "file" { path: "/disk0/tmp/music.raw"; }; sample: "S16_LE"; channels: 2; }; output "stereo dipole left", "stereo dipole right" { device: "file" { path: "output01.raw"; }; sample: "S16_LE"; channels: 2; }; filter "left speaker direct path" { inputs: 0/6.0; outputs: 0; process: 0; coeff: "direct path"; }; filter "left speaker cross path" { inputs: "right"/6.0; outputs: "stereo dipole left"; process: 0; coeff: "cross path"; }; filter "right speaker direct path" { inputs: "right"/6.0; outputs: "stereo dipole right"; process: 1; coeff: "direct path"; }; filter "right speaker cross path" { inputs: "left"/6.0; outputs: "stereo dipole right"; process: 1; coeff: 1; };
I/O modules are used to provide sample input and output for the BruteFIR convolution engine. It is entirely up to the I/O module of how to produce input samples or store output samples. It could for example read input from a sound card, a file, or simply generate noise from a formula.
In the BruteFIR configuration file, an I/O module is specified in each input and output structure.
The purpose of having I/O modules instead of building all functionality directly into BruteFIR is that it should be easy to extend with new functionality, without compromising the core convolution engine.
All I/O modules has the extension ".bfio".
The ALSA I/O module (named "alsa") is used to read and write samples
from/to sound cards. It supports all BruteFIR sample formats also
supported by the referenced sound device. The basic configuration is
simple, only one field, called device
need to be set, where
the associated value is a string which is passed without modification
to ALSA's device open function. Examples: "alsa" { device: "hw";
}
or "alsa" { device: "hw:1"; }
.
In the above examples, the hardware is accessed directly (the "hw" prefix), but you can also use ALSA's software modes. That is however not recommended, since some functions of BruteFIR, for example overflow protection, expects to be at the very last output stage, and not before another software layer which may perform for example mixing or volume control.
In theory it should also be possible to access files (for example
wav-files) through ALSA, "alsa" { device: "file:test.wav"; }
but this does not seem to work currently, and is not recommended,
since the module assumes that all devices are driven by a sample clock
(thus is a sound card).
If the ALSA I/O module is used in several input/output structures, all
referenced sound cards will be linked together using the ALSA
API. This makes starting and stopping sound cards synchronized, if the
hardware and driver supports it, if not, the ALSA subsystem tries to
make starting and stopping is synchronized as it can. However, when
there are many alsa devices used, this linking can cause the computer
to lock up, at least it has happened in the past. This is probably due
to a problem in ALSA, and may have been resolved when you read
this. However, should you bump into problems, you can disable linking
by setting link
to false (example: "alsa" { device:
"hw:1"; link: false; }
).
Per default, when reading fails due to an overflow, or writing fails
due to and underflow, BruteFIR will abort. If your computer is heavily
loaded, and/or partitions are short, and/or other services are running
on the computer, over/underflow can occur occasionally. In those
cases, one might rather get occasional clicks in the sound rather than
a total stop. The ALSA I/O module can hide over/underflow from
BruteFIR, and thus it will not abort when that occurs. Just set the
ignore_xrun
parameter to true (example: "alsa" { device:
"hw:1"; ignore_xrun: true; }
).
The OSS I/O module (named "oss") provides sound card I/O through the
OSS API. It has only one parameter, device
, which points out
the device to open. Example: "oss" { device: "/dev/dsp"; }
.
The I/O module supports OSS multi-channel and full duplex modes.
The JACK I/O module (named "jack") provides BruteFIR with support for the low-latency JACK audio server [23]. JACK is an audio server under development, and the goal for the JACK I/O module is that it should be compatible with the current CVS version.
To avoid putting I/O-delay into the JACK graph, the JACK buffer size should be set to the same as the BruteFIR partition size. It is however possible to set the JACK buffer size to a smaller value. The I/O-delay in number of JACK buffers as seen by following JACK clients will be:
2 * <BruteFIR partition size> / <JACK buffer size> - 2
Note that both the JACK buffer size and BruteFIR period size is always a power of two.
Currently, the JACK I/O module assumes that jackd is run with the -R parameter, at its default client realtime priority which is 9.
The names of the BruteFIR ports will be "brutefir:input-X" for the
inputs, and "brutefir:output-X", where X is the channel index. The
JACK client name which is per default "brutefir" can be changed, by
setting "clientname" (example: clientname: "brutefir-A";
). It
is a global setting, and if used it must be set in the first JACK
device clause (the first from the top in the configuration file). The
clientname will change the port name prefix as well (the prefix is the
client name). If multiple BruteFIR instances should be run, they must
have different client names, or else the port names will collide.
If the local ports should be connected to other JACK ports at startup,
the setting ports
is used, where the associated string values
are the names of the ports to connect to. Examples:
"jack" { ports: "alsa_pcm:capture_1", "alsa_pcm:capture_2"; }
for input, and "jack" { ports: "alsa_pcm:playback_1",
"alsa_pcm:playback_2"; }
for output. The port listing must be set
to the same amount as the number of channels for the device. However,
empty strings could be used if a specific channel index should not be
connected, for example: "jack" { ports: "", "alsa_pcm:capture_2";
}
will only connect the second port.
It is also possible to optionally specify the port names to other than
the default naming, like this: "jack" { ports:
"alsa_pcm:capture_1"/"in-A"; }
, that is adding a slash and
specifying a name after that, this will replace the default "input-X"
for inputs and "output-X" for outputs. If a port should not be
connected but still be named, the first string is empty, like this:
"jack" { ports: ""/"in-A"; }
.
If no ports should be connected, and the client name is left to the
default, the JACK device clause is empty ("jack" { };
).
The sample format for the JACK device should be set to AUTO
,
which will be the JACK sample format (floating point).
The raw PCM file I/O module (named "file") is used to read and write
samples from/to files. It supports all BruteFIR sample formats and
reads/writes them directly in raw form, interleaved format. The
parameter string is in the simplest case the filename. Example:
"file" { path: "test.pcm"; }
. One can also specify how many
bytes to skip in the beginning for input files, and if to append
output files. Examples: "file" { path: "test.pcm"; skip: 44;
}
and "file" { path: "test.pcm"; append: true; }
.
It is also possible to read from and write to text files (X floating
point ASCII values per line separated with whitespace, where X is the
number of channels). Just add the option text: true;
. The
module will convert to/from 64 bit floating point, and thus requires
that sample format (or use AUTO
).
If the file I/O module is used for input, the input file can be
looped, by setting loop
to true.
By using /dev/stdin
like this "file" { path:
"/dev/stdin"; }
, BruteFIR will read data from standard input, so
it is then possible to do things like
mpg123 -s test.mp3 | brutefir
.
This will probably never be documented. The best way is to look at the source code to see how it is done.
The CLI logic module (named "cli") provides a command line interface available through telnet, a local socket, a pipe, or a serial line. The CLI is used for changing settings in runtime, which is of course only suitable when BruteFIR is used in realtime. It can be used interactively by hand, for example by connecting to it through telnet. It is also suitable for scripting BruteFIR, or using it as a means of inter-process communication if BruteFIR is used as the convolution engine for another program.
The context sensitive port
field specifies which
interface will be used as follows:
port: <INTEGER: TCP port number>;
the CLI will
listen on the given port number for incoming telnet clients.
port: <STRING: "/dev/" ...>;
when the string starts
with "/dev/" the CLI assumes a serial device (such as "/dev/ttyS0" on
Linux) is pointed out, and opens it as a serial port, with the default
line speed 9600 baud, if not the line_speed
field is used
specifying another speed.
port: <STRING: name of local socket>;
any other
string not starting with "/dev/" is handled as the file name for a
local socket, and the CLI will create and listen for incoming
connections on the given path. If the path exists, it will be
replaced.
port: <INTEGER: read end file descriptor>, <INTEGER:
write end file descriptor>;
the CLI will assume that the given
file descriptors are already opened and ready for use, and will attach
the read end to CLI input, and the write end to CLI output. This
interface is suitable as inter-process communication when BruteFIR is
integrated into another program, and is started through fork() and
exec().
The CLI does not have much terminal functionality to speak of, and is thus a bit cumbersome to use interactively. It reads a whole line at a time, and can interpret backspace, but that is about it. There is no echo functionality so the connecting client needs to handle that (telnet does, and terminal software for serial lines usually have a function to enable local echo).
Instead of specifying a port, one can specify a string of commands,
which will be run in a loop as a script. Example: "cli" { script:
"cfc 0 0;; sleep 10;; cfc 0 1;; sleep 10"; }
. The script may span
several lines. Each line is carried out atomically (this is also true
for command line mode), so if there are several commands on a single
line, separated with semicolon, they will be performed atomically (an
atomic set of statements). The exception is when an empty statement is
put in the line (just a semicolon), like in the script example, this
will work as a line break, and thus separate atomic sets of
statements.
A typical use for atomic set of statements is to change filter coefficients and volume at the same time.
The sleep
function in the CLI allows for sleeping in seconds,
milliseconds or blocks. One block is exactly the filter length in
samples, and if partitioned, it is the length of the partition. Block
sleep can only be used in script mode.
When in script mode, the first atomic statements will be executed just before the first block is processed, then the block is processed (and sent to the output), and then the next set of atomic statements is run. That is, each set of atomic statements is performed before the corresponding block is processed. The next atomic statement set is not performed until the next block is about to be processed.
The block sleep command (only works in script mode) works such that
the sleep is commenced at the next block. The statement
sleep b1;
will thus cause the next block to be
skipped. Note that since one block passes for each atomic statement
set, a single line with only sleep b1;
will skip two
blocks, not one, since one block is consumed when parsing the sleep
command, and the other is skipped by the sleep duration. That is to
skip only one block, either use sleep b0;
alone, or use
sleep b1
as the last statement together with other
statements in an atomic statement set (recommended).
Sleep in seconds and milliseconds will start the timer when the command is issued (at the start of the block if in a script), and continue with the next command after at least the given time has passed. If run in a script, the timer is polled at the start of each block, and the next command is then executed at the start of the first block where the timer has expired.
If several sleep commands are executed in the same atomic statement set in a script, only the last will take effect, and will be executed only when all other commands in the set have been processed. To avoid confusion, it is thus recommended to employ sleep commands either alone, or as the last in the atomic statement set.
If the field echo
is set to true, the CLI commands will be
echoed back to the user (the whole line at a time). This is off per
default.
When connected and you type "help" at the prompt, you will get the following output:
Commands: lf -- list filters. lc -- list coefficient sets. li -- list inputs. lo -- list outputs. lm -- list modules. cfoa -- change filter output attenuation. cfoa <filter> <output> <attenuation|Mmultiplier> cfia -- change filter input attenuation. cfia <filter> <input> <attenuation|Mmultiplier> cffa -- change filter filter-input attenuation. cffa <filter> <filter-input> <attenuation|Mmultiplier> cfc -- change filter coefficients. cfc <filter> <coeff> cfd -- change filter delay. (may truncate coeffs!) cfd <filter> <delay blocks> cod -- change output delay. cod <output> <delay> [<subdelay>] cid -- change input delay. cid <input> <delay> [<subdelay>] tmo -- toggle mute output. tmo <output> tmi -- toggle mute input. tmi <input> imc -- issue input module command. imc <index> <command> omc -- issue output module command. omc <index> <command> lmc -- issue logic module command. lmc <module> <command> sleep -- sleep for the given number of seconds [and ms], or blocks. sleep 10 (sleep 10 seconds). sleep b10 (sleep 10 blocks). sleep 0 300 (sleep 300 milliseconds). abort -- terminate immediately. tp -- toggle prompt. ppk -- print peak info, channels/samples/max dB. rpk -- reset peak meters. upk -- toggle print peak info on changes. rti -- print current realtime index. quit -- close connection. help -- print this text. Notes: - When entering several commands on a single line, separate them with semicolons (;). - Inputs/outputs/filters can be given as index numbers or as strings between quotes ("").
Most commands are simple and don't need to be further explained. Naturally, any changes will lag behind as long as the I/O delay is. The exception is the mute and change delay commands, they will lag behind as long as the period size of the sound card is, which most often is smaller than the program's total I/O delay. However, when there is a virtual channel mapping, the mute and delay will be lagged as well.
The imc
, omc
and lmc
commands are used to
give commands to I/O modules and logic modules in run-time. To find
out which modules that are loaded and which indexes they have, use the
command lm
. Not all modules support run-time commands though.
Changing attenuations with cffa
, cfia
and
cfoa
can be done with dB numbers or simply by giving a
multiplier, which then is prefixed with m
, like this cfoa
0 0 m-0.5
. Changing the attenuation with dB will not change the sign
of the current multiplier.
The equalizer logic module takes control over one or more coefficient sets, and renders equalizer filters to them, as specified by the user. This can be done in the initial configuration, and also updated in runtime, through the CLI.
The startup configuration can look like this:
"eq" { debug_dump_filter: "/tmp/rendered-%d"; { coeff: 0, 1; #bands: "ISO octave"; #bands: "ISO 1/3 octave"; bands: 100, 200, 500; magnitude: 20/-3.2, 100/8.5; phase: 20/0, 100/180; }; { coeff: "eq-1"; bands: "ISO octave"; magnitude: 31.5/-3.2, 125/8.5; phase: 31.5/3.2; }; };
If you want to analyze the rendered filters, the
debug_dump_filter
setting specifies a file name where the
rendered coefficients will be written. It must contain %d, which will
be replaced by the coefficient index. Then follows equalizers. Each
specify which coefficient index (or name) it should render the
equalizer filter to. These must be allocated and must be stored in
shared memory, for example like this:
coeff 0 { filename: "dirac pulse"; shared_mem: true; blocks: 4; };
The dirac pulse will be replaced by the rendered filter. Each equalizer has a set of frequency bands (max 128), they can be manually specified, or use the ISO octave band presets. Optionally, magnitude (in dB) and phase (in degrees) settings can be specified. The frequency value must then match one of the given bands.
If you specify two filters, the rendering will be double-buffered, meaning that the eq module will keep one coefficient active in the filter(s), and render to the other, and switch when ready. This means that there is no risk of playing an incomplete equalizer, which can cause some noise (usually in the form of a beep), thus it is recommended to use double-buffered mode if the equalizer will be altered in runtime. In the filter configuration and when referring to the equalizer in the CLI, the first of the two coefficients should then be used.
In run-time, equalizers can be modified through the CLI. An example:
lmc eq 0 mag 20/-10, 4000/10
will set the magnitude to -10 dB
at 20 Hz and +10 dB at 4000 Hz for equalizer for coefficient 0. Instead
of mag
, phase
can be given. The command lmc eq
"eq-1" info
will list the current settings for the equalizer
stored in the coefficient called "eq-1".
The more heavily loaded the computer is by convolution, the longer time it will take to render the new equalizer. If the coefficient set it renders to is very short, and the magnitude and phase response is very detailed (sharp edges etc) it will not be able to adapt to it fully.
This will probably never be documented. Just look at the source code and see how it is done.
The program calculates a realtime index which can be shown through the CLI, or will be printed periodically to the screen if the show_progress flag is set. The realtime index is a floating point value. When it is 1.0, 100% of the available processing power must be used at all times to be able to achieve realtime performance. If it is larger than 1.0, it means that with the current configuration, BruteFIR will not manage realtime performance.
If your configuration is too demanding for realtime, you should shorten the filters (or remove channels) until the realtime index is very close below 1.0, perhaps 0.95. This way you make full use of your computer. However, if you have multiple processors, it is not as simple. The realtime index will show how much is needed from the most loaded processor, but leaves a proper load balancing to you. So, devise your configuration carefully if you have multiple processors. The number of input and output channels and the filter length is what steals processor time. The number of filters, dither, delay, mixing and attenuation is very cheap in comparison.
When testing with realtime indexes above 1.0, inputs and outputs must of course be files. For performance testing, you could use "/dev/zero" for input and "/dev/null" for output. Also note that it takes some time for the index to stabilize.
The realtime index typically matches the processor load, if running with a sound card. However, if input poll mode is employed, real time index can be considerably lower than the processor load, since input polling is performed in the spare processor time.
When BruteFIR runs for the first time, it will generate FFTW wisdom, which takes some time. FFTW wisdom is benchmarking information which tells the FFTW library how to run FFT the most efficient way on the given computer. Since the information is hardware and binary dependent, the file should be removed when hardware is changed/upgraded or BruteFIR is recompiled. A wisdom file that was not generated on the hardware BruteFIR is running on, or not by the binary that is run, may yield sub-optimal performance. When BruteFIR is calculating FFTW wisdom, the computer should not be running other processor-demanding software.
Naturally, it is very important that FFTW was compiled with the correct optimization flags to achieve optimal performance.
The wisdom is loaded used and updated each time BruteFIR is run. Each time BruteFIR uses a partition length it has not used before (and thus there is no wisdom available), it will need to generate new wisdom, which will take some time.
If you are going to use BruteFIR in realtime, it is strongly recommended that you patch your kernel to reduce latency, or else the program may fail to keep up when a cron-job or a screen saver starts. The Linux kernel's latency problems has been reduced in the 2.4 kernel, but it is still not satisfactory without the patch applied.
For the 2.4 kernel, Andrew Morton's low latency patches are recommended [24].
The new 2.6 kernel does have a low-latency setting in the kernel configuration, which should be activated. Although no extra patches should be required for a 2.6 kernel in the normal case, there still are low-latency patches out there for really demanding situations.
If you use digital input and output, as I would recommend, you may get problems if the sound card is not configured properly. It is very important that the input and output sample clock use the same clock as reference. Or else, micro-differences between the input and output sample clock will make BruteFIR's IO buffers to slide apart, and eventually make the program stop. Usually there is an option to set the digital sound card's sample clock to 'slave'.
If you have analog input or output or both, you cannot get this problem (unless you use several different sound cards, then it will fail due to differences in clocking).
Digital sound cards that work in slave mode allows that the sample clock is changed in runtime. Usually, this is not what one want for BruteFIR, since the filters are designed for only one sample rate. Therefore BruteFIR can be configured to exit if it detects a sample clock different from the one mentioned in the configuration file.
BruteFIR can run with 32 or 64 bit floating point internal
resolution. Traditionally, 32 bit is called "single precision", and 64
bit "double precision". The float_bits
setting is used to
change resolution. Per default, BruteFIR runs in 32 bit.
Depending on processor used, you may lose assembler optimizations when running in 64 bit. Also, memory bandwidth used by BruteFIR will naturally double, which reduces performance. Thus, although 64 bit and 32 bit operations are generally equally fast, due to increased memory usage, BruteFIR needs 30 - 50% extra processor time, not counting additional effects if assembler optimizations are lost.
When do you need double precision? If you are picky enough on sound quality that you would require dither on 24 bit output, then you need double precision. For most audio work however, 32 bit precision is enough.
There is no formula for calculating the optimal number of partitions to get maximum throughput. It varies between hardware platforms, so trial and error is the only working method. More than about 16 partitions are generally not recommended though.
If you are using partitioned filters to reduce the I/O-delay for realtime filtering, make sure that it does not get too low. If I/O-delay is too low, the sound card can get overflowed/underflowed causing the program to exit with a broken pipe signal.
Extreme low latencies, such as 64 sample partitions, will probably not work for long periods of time, even with a low latency patched kernel.
The processor cannot be loaded more than typically 85% for safe realtime operation. For very low latencies, this number could go down to 70%. The reason for this is that computing time will vary somewhat, that is how modern computers work, and to be able to cope with the maximum computing times, some spare processor time must be left.
Which new features that get into BruteFIR are decided by its users. If you need a feature, let me know, and I'll see what I can do (and want to do).