python - 利布罗萨 0.8.0 |人声分离输出有效，但速度高达 200%

我正在用 python 编写这个脚本，将人声从音轨中分离出来，并将其写入音乐文件。为此，我选择了 librosa 作为图书馆。这是代码:

import numpy as np
import librosa.display
import soundfile as sf
import matplotlib.pyplot as plt
import librosa

y, sr = librosa.load('testInput.mp3')
output_file_path = "testOutput.wav"

S_full, phase = librosa.magphase(librosa.stft(y))
S_filter = librosa.decompose.nn_filter(S_full,
                                       aggregate=np.median,
                                       metric='cosine',
                                       width=int(librosa.time_to_frames(2, sr=sr)))
S_filter = np.minimum(S_full, S_filter)
margin_i, margin_v = 2, 10
power = 2

mask_i = librosa.util.softmask(S_filter,
                               margin_i * (S_full - S_filter),
                               power=power)

mask_v = librosa.util.softmask(S_full - S_filter,
                               margin_v * S_filter,
                               power=power)

# Once we have the masks, simply multiply them with the input spectrum
# to separate the components
S_foreground = mask_v * S_full
S_background = mask_i * S_full
D_foreground = S_foreground * phase
y_foreground = librosa.istft(D_foreground)
sf.write(output_file_path, y_foreground, samplerate=44100, subtype='PCM_24')

虽然有些效果，但输出速度提高了 200%，这也导致声音的音调高了很多。无论输入是什么，输出都会听起来像 alvin 和花栗鼠。有谁知道如何解决这个问题，或者我做错了什么？

最佳答案

librosa.load 将默认重新采样到 22050 Hz。要保留输入的原始采样率，请使用 librosa.load(..., sr=None)。但是请注意，librosa 中的许多参数都针对 22050 Hz 进行了调整，例如 FFT 长度等。在此示例中，至少是 stft 和 istft。

因此您可能还想尝试保持该采样率。在任何一种情况下，最好在调用 sf.write() 时使用 samplerate=sr 以避免对其进行硬编码。

关于python - 利布罗萨 0.8.0 |人声分离输出有效，但速度高达 200%，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65391055/

python - 利布罗萨 0.8.0 |人声分离输出有效，但速度高达 200%

上一篇：acumatica - 如何在 Acumatica 中使用推送通知？

下一篇：multilingual - 什么 's the meaning of "使用 bos_token，但尚未设置。”