Jump to content
  • 0

Voice-activited


Junior_jessy

Question

Hello,

I'm trying to make a voice-activated on a nesys4 and i want to use the pmodMic3.To do si, I want to use the SPI protocol but i don't know how. 

So how can i choose the SPI protocol ? 

Furthermore, I got some issues when filling in the constraint's file. On the web-site of Diligent, the pin2 is not connected but, the protocol SPI show the opposite. 

So how to Well connect the in/out of the peripheral to the board in the constraint's file? 

Thank you in advance for your answer. 

Junior 

Link to comment
Share on other sites

25 answers to this question

Recommended Posts

@Junior_jessy,

I see your question, but while I've done signal, audio, and even voice processing before, I've never done speech recognition.  For most of my career that has been in one of those "too hard" categories.  This was one of the reasons why I suggested you get your algorithm working off-line first.  I'm sorry I can't help much more than that, it's not something I know how to do.  (I know more about how to implement a given algorithm within either embedded s/w or an FPGA than how to build a voice recognition algorithm.)  That's why I was silent.

Dan

Link to comment
Share on other sites

Hi @D@n 

 

I'm trying to make an order when once of these words is spell : In French    "gauche", "droit", "haut", "bas"     or    English  "up", "down", "left", "right"      no matter. 

 I find a way to code the voice-activated through five steps: 

1 : remove signal silence , with an amplifier filter       silenceremoval.m 

2 : remove the noise, with a butterworth filter.           silince_noise_remove.m    

3: smooth the signal using a Simple gaussian smooth function smooth.mEndPointing.m

4 :  calculate the signal's mfcc ( Mel-frequency cepstral coefficients)    calculate_mfcc.m

5: calculate the DTW : Dynamic time warping between the reference signal and the input signal. when the two signal are equal its return 0 if not a real positive number.   DTW.m

To identify a word ,  SpeechTry.mgauchero2.wavbasro2.wav droitro2.wav  hautro2.wav   

I have four reference signals one for each word command. After the input input signal has passed steps 1 to 4 i calculate le DTW between the result and each signal reference. The smallest DTW identifies the word

I've got some issues, the algorithm is able to identify easily some word with nearly 90% of success such as "gauche" may be due to the fact that it's very different from the others words.  But others such as "haut" words still hardly identifiable.  

  

 

THANKS YOU IN ADVANCE FOR YOUR RESPOSNSES 

Link to comment
Share on other sites

I'm wary of Matlab magic converters.
Maybe they actually work and generate results that fit into an entry-level FPGA(!). If so, I'm the wrong guy to comment.

That said: If FPGA is a must, I would try to renegotiate the project scope to use a MCU (softcore or better the ARM on a Zynq board) with FPGA accelerators.

Then implement the prototype in plain C. Which you can test on a standard PC.

Then strip it down so it runs (very slowly) on the target platform's processor.

Then create FPGA accelerators for select functions and test them independent of the algorithm.
Maybe FFT is already enough (the MCU should be handle the log() with a fast approximation function).

Now if I had to make a project plan for a newcomer working independently, I'd say a month, or two to be on the safe side...in reality probably three.

Link to comment
Share on other sites

@Junior_jessy,

Ok, I take that back then ... it sounds from your description like you might be ready to move forward.  I've heard that Cepstral processing works quite nicely for speech, although I've never tried it myself.

So, your algorithm will have several parts ... I like how you've shown it above to work in several sections.  Now imagine that each of those sections will be a module in your design.  (An entity for VHDL types.)  That module should take no more than one sample input per clock, and produce one sample output per clock.  Your goal will be to do all the processing necessary one sample at a time.

This applies to the Cepstrum as well.  As I recall, a Cepstrum is created by doing an FFT, taking the log of the result (somehow), and then taking an IFFT.  FFTs in FPGAs tend to be point by point proccesses: you put one sample in at a time and get one sample out.  So expect to do a lot of stream processing.  Alternatively, for audio frequencies, it might make sense to do some block processing ... but that's something you'll need to decide.  Either way, it's likely to look to you like you are processing one sample at a time.

You mentioned above that all you knew how to do were minimal counters, shift registers and such.  Relax: You are in good hands.  Most of the designs I just mentioned, to include the FFT, are built out of primarily shift registers and counters and other simple logic.  However, there are two other components you will need to work this task: You'll want to know how to do a (DSP-enablerd) multiply, and how to use the Block RAM within your FFT.  The rule for both of these is that they should use a process all to themselves.  One process to read from RAM, one process to write to RAM, and one process to do your multiply--don't do any operations in any of those three types of processes.  That's the tableau you have to work with.  How you put it together--that's up to you and your skill as an engineer.

As for the FFT, you are welcome to use mine or you can use Xilinx's.  That'll at least keep you from rebuilding that wheel.

You might find it valuable to rearrange the Octave script you've highlighted above so that it works on one sample at a time as I'm describing.  Think about it.

Hopefully, though, that gets you going ... a bit.

Keep me posted,

Dan

Link to comment
Share on other sites

There is only one known pattern : signal .  Maybe you need to know what are the input signal and template. 

I built signal by calculating the mfccs of a reference signal. Its only one word record such as "up" 

template is the  mfccs calculate from an ad-hoc recorded signals. DTW is not my main function. 

I haven't considered that the amplitude can change cause the record will be made in the same condition. 

But you right I can manage real speech. I record only when I say a word . Then I remove noise, silence ,calculate mfcc of the record and I compare the result if the reference using DTW. 

 

function  [mm,t,Matched,TEST_SMOOTHED  ]= SpeechTry(TEST)  % TEST is the ad-hoc record during 2secondes

 

UP = wavread('avancero1.wav'); %1               %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

DOWN = wavread('reculero1.wav'); %2             %%

LEFT = wavread('montero1.wav'); %4                 %%%   signal reference

RIGHT = wavread('descendrero1.wav'); %3         %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

TEST_SMOOTHED =  EndPointing(TEST);       %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

UP =  EndPointing(UP);                                       %%%

DOWN =  EndPointing(DOWN);                         %%    remove silence and noise from signals

RIGHT =  EndPointing(RIGHT);                          %%

LEFT =  EndPointing(LEFT);                              %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

FeatureVectorsT =calculate_mfcc(TEST_SMOOTHED);     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

FeatureVectorsUP =calculate_mfcc(UP);                               %%

FeatureVectorsDOWN =calculate_mfcc(DOWN);                  %%  calculate de mfcc

FeatureVectorsRIGHT =calculate_mfcc(RIGHT);                   %%

FeatureVectorsLEFT =calculate_mfcc(LEFT);                        %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

 

DUP =DTW(NFeatureVectorsT,NFeatureVectorsUP);                      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

DDOWN =DTW(NFeatureVectorsT,NFeatureVectorsDOWN);         %%  calculate DTW

DRIGHT =DTW(NFeatureVectorsT,NFeatureVectorsRIGHT);         %%  its return a real

DLEFT =DTW(NFeatureVectorsT,NFeatureVectorsLEFT);              %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

[A,I] = min(M);                               %%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if(I==1)                                          %

Matched = UP;                             %

end                                                 %

if(I==2)                                           %

Matched = DOWN;                       %

end                                                %         The min min of the dtw indicate the word

if(I==3)                                          %              that is the nearest to the record

Matched = RIGHT;                       %

end                                                 %

if(I==4)                                          %

Matched = LEFT;                         %

end                                                %

t(1) = I;                                           %%%%%%%%%%%%%%%%%%%%%%%%%%%%

 

 

Junior Jessy

Link to comment
Share on other sites

@Junior_jessy,

Well, let's start with those items that will be constant.  Constant values should be declared as generics, constant vectors should be placed in some kind of RAM--either block RAM or SDRAM depending upon the size.  How you then interact with this data will be very different depending upon which you choose.

I notice that you are comparing your data against a known pattern, but looking for the maximum difference in (signal - template).  Have you considered the reality that the signal of interest might have a different amplitude?  That the voice speaking it might be using a different pitch or a different cadence?  Testing against ad-hoc recorded signals (not your templates) will help to illustrate the problems with your current method.  Ask your significant other, for example, to say some of the words.  Then you say it.  Then see which matches.

It looks like you might be doing a correlation above.  (A correlation isn't recommended ... you aren't likely to be successful with it--see above.)  If you find that you need to implement a correlation, your method above won't accomplish one very well..  Your approach above is more appropriate for attempting to match a single vector, not for matching an ongoing stream of data.  Correlations against a stream of data are often done via an FFT, point by conjugate point multiplication, followed by an inverse FFT.  If you do one FFT of the input, and then three IFFT's depending upon which sequence you are correlating with, you can save yourself some FPGA resources.  Be aware of circular convolution issues when using the FFT--you'll need to understand and deal with those.  Once done, looking for the maximum in a stream of data is fairly simple.  This change should be made in Octave still.

All that said, your algorithm is not yet robust enough to work on real speech.  Looks like you haven't tried it on ad-hoc recorded speech yet, instead trying it on your small recorded subset.  Record an ad-hoc subset and see what happens.  I think you'll be surprised and a bit disappointed for all the reasons I've discussed above.

Have you seen my article on pipeline strategies?  It would help you to be able to build some pipeline logic here when you are finally ready to move to VHDL.  (You aren't ready for the move yet)

How about my example design that reads from an A/D, downsamples and filters the input, FFT's the result, and then plots the result on a simulated VGA screen?  You might find it instructive as you start to think about how to go about this task.  (You aren't ready for this yet either--but it might be a good design to review.)

Realistically, though, the bottom line is that you really have some more work to do before moving to VHDL.  In particular, you need to test your algorithm against ad-hoc sampled data separate from your training data.  After that, and after you fix the bugs you'll discover by doing so, you'll have a better chance of success once you do move to VHDL.

Dan

Link to comment
Share on other sites

Hi @zygot @D@n 

thank for your responses. 

I have done what you advice me and now I want to translate into VHDL.  I don't know where to start cause I used to write simple VHDL code such as counter , multiplexer, ram memory,  transcoder for display on 7-segment . ( I used to code a MP3 player )  

However I don't know where to start and which methodology i need to use to translate this 

 

 

% dynamic time warping of two signals

 

function totalDistance=DTW(signal,template)

 

 

signalSize=size(signal,1);

templateSize=size(template,1);

if signalSize > 2*templateSize || templateSize > 2*signalSize

    error('Error in dtw(): the length difference between the signals is very large.');

end

 

%% initialization

distances=zeros(signalSize,templateSize)+Inf; % c3ache matrix

 

distances(1,1)=norm(signal(1,:)-template(1,:));

%initialize the first row.

for j = 2:templateSize

    cost = norm(signal(1,:)-template(j,:));

    distances(1,j) = distances(1,j-1) + cost;

end

%initialize the first column.

for i = 2:signalSize

    cost = norm(signal(i,:)-template(1,:));

    distances(i,1) = cost+ distances(i-1,1);

end

 

%initialize the second row.

for j = 2:templateSize

    cost = norm(signal(2,:)-template(j,:));

    %distances(i,j)=cost+min( [      LEFT      ,      CORNER     ] );

     distances(2,j)=cost+min( [distances(2,j-1), distances(1,j-1)] );

end

 

%% begin dynamic programming

for i=3:signalSize

    for j=2:templateSize

        cost=norm(signal(i,:)-template(j,:));

        %distances(i,j)=cost+min( [ LEFT           , CORNER            , FAR_CORNER        ] );

         distances(i,j)=cost+min( [distances(i,j-1), distances(i-1,j-1), distances(i-2,j-1)] );

    end

end

totalDistance=distances(signalSize,templateSize);

end

 

 

 

thank you in advance 

Link to comment
Share on other sites

@Junior_jessy,

Please allow me to add a word or two to @zygot's advice.

In particular, you've done your testing (so far) on recorded speech.  Moving from recorded to live speech is a testing challenge (nightmare?).  You'll want to be able to not only process the live speech, but also to be able (just after the fact) to grab recordings of any live speech that didn't process as you wanted it to, so that you can place these recordings into your MATLAB framework and see what went right (or wrong) about them.  It is possible to use your computer microphone audio for this.  You'll learn a lot from host audio, just not enough.

Since @zygot mentioned them, I'll admit to having worked with the MATLAB simulink to HDL tools before.  I would not recommend them to anyone who is considering them.  They are great for graphically designing code, horrible when trying to do a diff to see if anything changed and horrible when digging for the details within a design.  Likewise I had bad experiences trying to find all the top level ports and internal comments.  Finally, once I stopped paying for the license (and swapped laptops), the code was lost to me.  Not something I'd recommend.

One other point: @zygot said "Simulate, simulate and simulate."  I agree completely.  Were I a professor, I would even stomp my foot at this point.  It will be on the final exam.  Simulate the A/D, simulate your algorithm, simulate both together!  Create a simulation application, using your favorite HDL utility, that will allow you to input any audio file and test it.  Examine what happens if the audio is earlier or later, stronger or weaker, faster or slower, etc.  Examine up to two audio files in succession in this manner!  Don't stop at one.  (I've found a lot of bugs in the dead period between audios, where the algorithm is recycling.)  In addition, I'd also suggest formally verifying whatever you can.  (The A/D is a known candidate for formal verification ...)  I've been caught multiple times over by bugs I'd never find in simulation that I then find with formal verification, so I highly recommend that to you.  ( @zygot if you want to dig into this further, I'd be glad to, but let's start another topic for it if you are so inclined.)  The most recent example?  A UART transmitter that couldn't handle a transmission in 10*baud clocks, but rather 10*baud clocks + 1.

Just my two cents,

Dan

Link to comment
Share on other sites

In theory, if your FPGA has a processor running some version of Linux, you might be able to run OCTAVE on your target. I doubt that you'll be able to run your code in real time though.

The approach is this:

  1.  Design your algorithm in OCTAVE or MATLAB. You should freely use any high-level code words to quickly get to a workable, verifiable result
  2.  Re-design your working project using OCTAVE but replacing all high-level code words with simple if..then..else or while.. code words. The idea here is to replicated the high-level OCTAVE with simpler constructs that your can understand well enough to translate into data structures, HDL structures, and processes. This is the hard part. You will have to deal with any fixed point or block floating point that you require. You will have to design for bit growth, rounding, and all of the low level stuff that the high-level OCTAVE takes care of for you. Did I mention that this is the hard part? Once you've done this you can design your basic HDL approach. It takes some insight to know how to implement complicated mathematical constructs using just basic FPGA resources.
  3. Design an FPGA implementation that utilizes the resources of your device.
  4. Simulate, simulate and simulate before trying to create a configuration file.
  5. Run the design in hardware. Verify that the results are close enough to your second OCTAVE implementation.

There are no short-cuts that I know of. This is how it's done everywhere that I've worked. This is how I do it.

For those who are well-heeled, Mathworks will sell you tools to ease the transition from high-level MATLAB to almost unreadable HDL.. but it is very expensive and there are no guarantees that the tools will support your requirements. For your application it probably will be helpful. The bad part is that not only do you have to pay Mathworks but you will have to pay your FPGA vendor for tools support.

Link to comment
Share on other sites

Hi @D@n @zygot @xc6lx45

Happy new year  ☺️ .  

Looking up to my project,

I'm trying to move the matlab code to VHDL and i'm using HDL coder. However, i have got some issues that i can't fix. I learnt that i have to modify the matlab code but in my case, i don't know how to do it . i will be greatful if you help me. 

Because i'm running out the time, is there a way to run matlab code on FPGA without turning it to vhdl? 

 i attached my code to this message.  

THANKS YOU IN ADVANCE FOR YOUR RESPOSNSES 

test.zip

Link to comment
Share on other sites

5 minutes ago, D@n said:

I'd be boneheaded enough to do it in C++ personally

Since VHDL and Verilog are both languages created for simulation you could, in theory, prototype your algorithm in either of these. That might properly be considered bone-headed....

Link to comment
Share on other sites

@Junior_jessy,

Ditto @xc6lx45's advice.

I've spent decades of my life working on signal processing issues to include voice processing.  The basic rules were: 1) get it running off line, and then 2) get it running within whatever special purpose hardware (microcontroller, FPGA, etc) is required.  This allows us to debug the algorithm where the debugging was easy (Matlab or Octave), so that we'd only need to debug the hardware implementation.  As an added benefit, you could send the same samples through both designs (external software, online hardware) and look for differences, which you could identify then as bugs.

Trust me, this would be the fastest way to get your design working in VHDL.  If you run to the FPGA too fast, you'll 1) spend hours (days or weeks even!) debugging your algorithm, and 2) you'll cement parts of your algorithm before you know that they are working resulting in more rework time.

Now, let's discuss your voice cross-correlation approach: It won't work.  Here's why:

Voice has pitch.  You can think of the "pitch" as a fundamental frequency of which they are many harmonics.  Pitch is not a constant, it is a varying function.  The same word can be said many different ways, while the pitch just subtly shifts it around.  A cross-correlation will force you to match pitch exactly, which it will never do.

That's problem one.

Problem two: Vocal cadence.  You can say the word "Hello" at many different speeds and it will still be the same word.  Hence, not only does your comparison need to stretch or shrink in frequency to accommodate pitch, it also needs to stretch or shrink in time to accommodate cadence.

That's problem two.

Problem three: Your mouth will shape the sound you make based upon the position of your jaw and your tongue (and probably a bit more).  This acts upon the voice as a filter in frequency that doesn't scale with pitch.  That is, as the pitch goes from deep base to upper treble, the same mouth and tongue shape will filter the sound the same way.  (This assumes you could say the same word twice and get the *same* mouth shape.)

That's problem three.

Problem four: Sounds composed of fundamentals with harmonics tend to do a number on cross-correlation approaches.  Specifically, I've gotten a lot of false alarms using cross-correlations which, upon investigation, had nothing to do with what I was trying to correlate for.  A flute (or other instrument), for example, might give a strong cross-correlation score if you are not careful.

Four problems of this magnitude should be enough to suggest you should try your algorithm in Matlab or Octave (I'd be boneheaded enough to do it in C++ personally) before jumping to the FPGA.  Computers today have enough horsepower on them to even do this task in real-time, so that you don't need an FPGA for the task.  (FPGA's are still fun, though, and I'd be tempted to implement the result in an FPGA anyway.)

Were I you, having never worked with speech before, I'd start out not with the FPGA but rather with a spectral raster of frequency over time.  I'm partial to a Hann window, but the 50% overlap (or more) is required and not optional without incurring the wrath of Nyquist.  FFT lengths of about 20-50ms are usually good choices for working with voice and seeing what's going on within.

Then, when returning to the FPGA, I would simulate *EVERYTHING* before touching your actual hardware.  Make recordings while working with Octave, prove your algorithm in Octave on those recordings, then feed those recordings into your simulation to prove that the simulation works.  Only at that point would I ever approach hardware.

Oh, and ... I'd also formally verify everything too before moving to hardware.  Once formally verified, it's easy to make changes to your implementation and then re-verify that they will do what you want.  You might need this if you get to the hardware only to find you need to shuffle logic from one clock tick to another because you aren't meeting timing.  In that case, re-verifying what you are doing would be quite valuable.

Those are just some things to think about,

Dan

Link to comment
Share on other sites

1 hour ago, Junior_jessy said:

I have to describe my architecture on VHDL , i'm not coding on matlab.

The advice by @xc6lx45is the best you will get. So you have your logic structure that you hope will allow you to accomplish your goal. You've started at the wrong step. What are the measurement criteria by which you intend to 'match signals', as you see it. The place to start is to test your concept in OCTAVE or MATLAB to prove that your algorithm works. This solution is not specific to any hardware. Once you get the desired results in OCTAVE simulation you then design the hardware to replicate it in hardware. Recognizing human phrases is a lot more complicated than applying one 'magic' mathematical technique. OCTAVE will let you use an audio file or microphone input to figure this out for yourself. This is one project where your simulation software (OCTAVE) can reproduce hardware results very accurately. If you can't make the OCTAVE work then you don't have the understanding needed to make your concept work in hardware. If it works in simulation but not in hardware then you know where to focus your energy on.

Link to comment
Share on other sites

>> Is that solution Will run?

I suspect you mean crosscorrelation, and no, it will most likely not work.

Maybe you'll save yourself much pain if you prototype the algorithm first in software. It doesn't need to be real time.
E.g. get freeware Octave and use the audioread() function. Be sure to use two independent recordings for reference and simulated microphone input.

 

Link to comment
Share on other sites

Hi @jpeyron

Thanks for your response.

Let me remind you that my project aim is to make a voice-activated with the pmod Mic3.

The solution that i find it's to make a memory with the differnet word that i have to recognize.

And therefore when the user speak i make an autocorrelation of his voice signal with What i record in the board.

Is that solution Will run? 

If yes, i need please to make a memory To record the 12bits output at the same Time that they are generate with an acquisition  frequency of 1seconde.

Thanks again for your response. 

 Junior 

Link to comment
Share on other sites

Hi @jpeyron

Yes i'm talking about the project done by @hamster.  I think it's just the noise that turn on the three first leds, because i were not talking in front of the microphone. In that case how can i reduce noice's power  and make a signal processing.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   thank you                                                                                                                                                                                                                                                                                                                             Junior     

Link to comment
Share on other sites

Hi @Junior_jessy,

Are you referring to the VHDL Pmod MIC3 project linked above done by @hamster?  What range of acoustic sound were you testing? The Pmod MIC3 has a MEMS Microphone Knowles Acoustics SPA2410LR5H-B acoustic sensor and a Texas Instruments ADCS7476 ADC. I would guess the LEDs are staying the same due to the acoustic range you are testing.

thank you,

Jon

Link to comment
Share on other sites

hi @jpeyron 

Thanks for  your response, 

i try the example and i have got some question please. 

When i displayed the  12bits of MIC3's output on the board's led i notice that the 11th bit is always ON  and it's the same for the three first bits and i don't understand why. 

Why i want to know that , it's because i have to make a signal processing of the  MIC3's output in order to recognize a word. 

 

Thank you in advance for your answer. 

junior 

Link to comment
Share on other sites

Hi @Junior_jessy,

Here is a VHDL project for the Pmod MIC3 using the basys 3 done by a community member @hamster.  I would suggest using the basys 3's xdc as a reference for the xdc of the Nexys 4. Please attach your xdc and hdl code. We have not had time to create an IP Core for the Pmod MIC3. If you were wanting to use microblaze and an IP Core you can use the Pmod DPG1 and alter the sdk code to interact with the Pmod MIC3.  You can download the vivado library here .  Here is a tutorial on using the Digilent Pmod IP Cores. You will need to use the digilent board files. Here is a tutorial on how to install the board file. 

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...