python - 3种python库中的MFCC和delta系数

标签 python audio speech-recognition librosa

我最近在做关于 MFCC 的作业,我无法弄清楚使用这些库之间的一些区别。

我使用的 3 个库是:

python_speech_features

SpeechPy

LibROSA

samplerate = 16000
NFFT = 512
NCEPT = 13

第 1 部分:梅尔滤波器组
temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]

pic1

Only the shape in speechpy will get (, 512), other all (, 257). The figure of librosa is a bit of deformation.



第二部分:MFCC
# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
                           preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
                                   num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
                                          hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
                                  hop_length=frame_step).T

pic2

I've tried my best to set the condition faire. The figure of speechpy gets darker.



第三部分:Delta系数
temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)

pic3

I can't directly set mfcc as argument in speechpy, or it will be very strange. And what these parameters originally act is not the same as my expected.



我想知道造成这些差异的因素是什么。这只是我上面提到的东西吗?还是我犯了一些错误?希望详细点,谢谢。

最佳答案

有许多 MFCC 实现,它们通常逐位不同 - 窗口函数形状、梅尔滤波器组计算、dct 也可能不同。很难找到一个完全兼容的库。从长远来看,只要您在任何地方使用相同的实现,这对您来说都无关紧要。差异不影响结果。

关于python - 3种python库中的MFCC和delta系数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50924493/

相关文章:

ios - openAL不会播放第二首歌

android - SpeechRecognizer 未启动

javascript - 如何在 1 个函数中多次调用语音识别?

python - 如何使 Lux 包在 Python 中工作?

python - 带有威胁回调的树莓派 RPi.GPIO 错误

image - neo4j存储和流传输视频和图像的最佳做法?

java - 如何在Jython中减少声音?

使用语音识别将 Python 语音转为文本

python - 无法为 pyjq 构建轮子 |没有这样的文件或目录 : 'autoreconf' : 'autoreconf'

python - WinError6 句柄无效 Python 3+ 多处理