Kaldi FBank算法阅读笔记
March 20, 2020, 8:09 p.m.
read: 2688
阅读过程
make_fbank.sh
$cmd JOB=1:$nj $logdir/make_fbank_${name}.JOB.log \
extract-segments scp,p:$scp $logdir/segments.JOB ark:- \| \
compute-fbank-feats $vtln_opts $write_utt2dur_opt --verbose=2 \
--config=$fbank_config ark:- ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$fbankdir/raw_fbank_$name.JOB.ark,$fbankdir/raw_fbank_$name.JOB.scp \
|| exit 1;
里面调用了compute-fbank-feats.cc
这个类中只是读取了原始数据,进行了一下封装而已。
int32 num_utts = 0, num_success = 0;
for (; !reader.Done(); reader.Next()) {
num_utts++;
std::string utt = reader.Key();
//读取数据
const WaveData &wave_data = reader.Value();
if (wave_data.Duration() < min_duration) {
KALDI_WARN << "File: " << utt << " is too short ("
<< wave_data.Duration() << " sec): producing no output.";
continue;
}
//读取通道数
int32 num_chan = wave_data.Data().NumRows(), this_chan = channel;
{ // This block works out the channel (0=left, 1=right...)
KALDI_ASSERT(num_chan > 0); // This should have been caught in
// reading code if no channels.
if (channel == -1) {
this_chan = 0;
if (num_chan != 1)
KALDI_WARN << "Channel not specified but you have data with "
<< num_chan << " channels; defaulting to zero";
} else {
if (this_chan >= num_chan) {
KALDI_WARN << "File with id " << utt << " has "
<< num_chan << " channels but you specified channel "
<< channel << ", producing no output.";
continue;
}
}
}
BaseFloat vtln_warp_local; // Work out VTLN warp factor.
if (vtln_map_rspecifier != "") {
if (!vtln_map_reader.HasKey(utt)) {
KALDI_WARN << "No vtln-map entry for utterance-id (or speaker-id) "
<< utt;
continue;
}
vtln_warp_local = vtln_map_reader.Value(utt);
} else {
vtln_warp_local = vtln_warp;
}
// KALDI_WARN << "vtln_warp_local ==> " << wav_rspecifier << "\n";
//reshape short类型的数据
SubVector<BaseFloat> waveform(wave_data.Data(), this_chan);
#ifdef HUPENG_DEBUG
KALDI_WARN << "waveform " << waveform.Dim() << waveform << "\n";
#endif
Matrix<BaseFloat> features;
try {
//实现在父类中 feature-common-inl.h
//会调用子类的相关方法
//如果采样率和目标采样率相同则开始进行特征提取
//如果采样率和目标采样率不同则依照--allow-downsample --allow-upsample
//参数配置决定是否升/降采样
//下面跳转到 feature-common-inl.h
fbank.ComputeFeatures(waveform, wave_data.SampFreq(),
vtln_warp_local, &features);
} catch (...) {
KALDI_WARN << "Failed to compute features for utterance " << utt;
continue;
}
在feature-commin-inl.h里面进行了加窗操作。
这个窗有很多选择,可以在参数中以—window-type来指定。
下面分析加窗函数。
这个加窗函数同时做了分帧加窗,比较有意思的是看看在参数—frame-length不是2的幂次的时候的处理方案。
经过查看,在进行高斯噪声,减均值,预加重,加窗等操作的时候均是采用的实际长度而非2次幂长度,在实际长度与2次幂长度的diff部分直接置为0了,设计的非常合理。
加窗函数在 feature-window.cc
void ExtractWindow(int64 sample_offset,
const VectorBase<BaseFloat> &wave,
int32 f, // with 0 <= f < NumFrames(feats, opts)
const FrameExtractionOptions &opts,
const FeatureWindowFunction &window_function,
Vector<BaseFloat> *window,
BaseFloat *log_energy_pre_window) {
KALDI_ASSERT(sample_offset >= 0 && wave.Dim() != 0);
int32 frame_length = opts.WindowSize(),
frame_length_padded = opts.PaddedWindowSize();
int64 num_samples = sample_offset + wave.Dim(),
start_sample = FirstSampleOfFrame(f, opts),
end_sample = start_sample + frame_length;
if (opts.snip_edges) {
KALDI_ASSERT(start_sample >= sample_offset &&
end_sample <= num_samples);
} else {
KALDI_ASSERT(sample_offset == 0 || start_sample >= sample_offset);
}
if (window->Dim() != frame_length_padded)
window->Resize(frame_length_padded, kUndefined);
// wave_start and wave_end are start and end indexes into 'wave', for the
// piece of wave that we're trying to extract.
int32 wave_start = int32(start_sample - sample_offset),
wave_end = wave_start + frame_length;
if (wave_start >= 0 && wave_end <= wave.Dim()) {
// the normal case-- no edge effects to consider.
window->Range(0, frame_length).CopyFromVec(
wave.Range(wave_start, frame_length));
} else {
// Deal with any end effects by reflection, if needed. This code will only
// be reached for about two frames per utterance, so we don't concern
// ourselves excessively with efficiency.
int32 wave_dim = wave.Dim();
for (int32 s = 0; s < frame_length; s++) {
int32 s_in_wave = s + wave_start;
while (s_in_wave < 0 || s_in_wave >= wave_dim) {
// reflect around the beginning or end of the wave.
// e.g. -1 -> 0, -2 -> 1.
// dim -> dim - 1, dim + 1 -> dim - 2.
// the code supports repeated reflections, although this
// would only be needed in pathological cases.
if (s_in_wave < 0) s_in_wave = - s_in_wave - 1;
else s_in_wave = 2 * wave_dim - 1 - s_in_wave;
}
(*window)(s) = wave(s_in_wave);
}
}
if (frame_length_padded > frame_length)
window->Range(frame_length, frame_length_padded - frame_length).SetZero();
SubVector<BaseFloat> frame(*window, 0, frame_length);
ProcessWindow(opts, window_function, &frame, log_energy_pre_window);
}
加窗这块的实现需要注意的是use_raw_log_energy这个参数,如果没有这个的需求大体上比较普通。
这个的主体函数中实现了分帧操作。加窗操作调用了ProcessWindow这个函数。下面跳到了这个函数查看一下:
void ProcessWindow(const FrameExtractionOptions &opts,
const FeatureWindowFunction &window_function,
VectorBase<BaseFloat> *window,
BaseFloat *log_energy_pre_window) {
int32 frame_length = opts.WindowSize();
KALDI_ASSERT(window->Dim() == frame_length);
if (opts.dither != 0.0)
Dither(window, opts.dither);
if (opts.remove_dc_offset)
window->Add(-window->Sum() / frame_length);
if (log_energy_pre_window != NULL) {
BaseFloat energy = std::max<BaseFloat>(VecVec(*window, *window),
std::numeric_limits<float>::epsilon());
*log_energy_pre_window = Log(energy);
}
if (opts.preemph_coeff != 0.0)
Preemphasize(window, opts.preemph_coeff);
window->MulElements(window_function.window);
}
ProcessWindow函数实际上完成了
- 加高斯噪声
- 减均值
- 预加重
- 能量计算(暂且不表)
- 加窗
feature-fbank.cc这个里面比较关键的是FFT在非2次幂的时候的处理情况,是用0补足2次幂后进行FFT,这个与PyTorch相互验证是一致的。
综上所述,整体的调用流程如下所示
computer-fbank-feat.cc --> (读取原始数据 int16格式)
|
|
V
feature-common-inl.h --> (重采样,分帧,高斯噪声,减均值,预加重,加窗)
|
|
V
feature-fbank.cc --> (fft, power谱计算,mel滤波,取log)
理论上MFCC 比FBank只是多了DCT,可能还有一阶差分,二阶差分其他的都是一样的了。
至此分析完成,以下是心得体会。
心得
- 1.默认情况下读出的short类型的数据并不除以32768以转换成float类型
- 2.默认情况下会做每帧数据会减掉每帧数据的平均值,也就是所谓的减均值操作。
- 3.默认情况下会做每帧数据的预加重操作,后一个采样点会减掉前一个采样点的97%的值,a[z] = a[z] - a[z-1] * 0.97, 而且是从后向前计算的。
- 4.默认情况下采用的窗函数为povey窗,据作者在注释里面说跟hamming窗较相似,也仅仅是相似。
- 5.
如果一帧的长度不是2的幂次方,也就是window_size & (window_size - 1) 的值不等于0的话,fft计算出来的结果跟pytorch自带的stft函数计算出来的结果是不相等的。作者给出了针对非2次幂的帧长的计算fft的方法。比较遗憾的是,这个方法在代码中无法被调用,而且即使修复代码,使其被调用也无法得到更pytorch里面的stft相同的值,这也可能是为什么这段代码无法被调用的原因。因此推荐采用2次幂的帧长,如帧长为512,帧移为256。 - 6.默认加入了高斯噪声,由选项dither控制。这样的话会导致即时提取同一段语音的特征,每次提出的特征都会有些许差异。
- 7.比较奇怪的一点是 默认的帧长为25ms(400个采样点),默认的帧移为10ms(160个采样点),在pytorch自带的stft函数中,不指定帧移的情况下默认帧移为帧长的1/4。
这个跟我之前的经验并不符合,大概不是做SE的话,并不是要求帧移为帧长的一半。 - 8.即使数据集里面有大于目标采样率或者小于目标采样率的文件,只需要指定 —allow-downsample=true 就可以下采样到指定采样率,不推荐进行上采样。实在是要上采样的话执行—allow-upsample=true就可以了。
参考
Kaldi: Kaldi
kaldi源码
torch — PyTorch master documentation
AD
kaldi 交流群:729152186