amazon-s3 - 如何将 S3 存储桶中的音频文件直接提供给 Google 语音转文本

标签 amazon-s3 speech-recognition speech-to-text google-speech-api google-cloud-speech

我们正在使用 Google 的语音转文本 API 开发语音应用程序。现在我们的数据(音频文件)存储在 AWS 上的 S3 存储桶中。有没有办法直接将 S3 URI 传递给 Google 的语音转文本 API?

从他们的文档看来,目前这在 Google 的语音转文本 API 中是不可能的 enter image description here

他们的愿景和 NLP API 并非如此。

  1. 知道为什么语音 API 存在这种限制吗?
  2. 有什么好的解决方法?

最佳答案

目前,谷歌 only allows来自本地源或 Google 云存储的音频文件。文档对此没有给出合理的解释。

Passing audio referenced by a URI More typically, you will pass a uri parameter within the Speech request's audio field, pointing to an audio file (in binary format, not base64) located on Google Cloud Storage

我建议您将文件移至 Google Cloud Storage。如果你不想,有一个很好的解决方法: 将 Google Cloud Speech API 与流式 API 结合使用。您不需要在任何地方存储任何东西。您的语音应用程序提供来自任何麦克风的输入。如果您不知道如何处理来自麦克风的输入,请不要担心。

Google 提供了一个 sample code这就是全部:

# [START speech_transcribe_streaming_mic]
from __future__ import division

import re
import sys

from google.cloud import speech

import pyaudio
from six.moves import queue

# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10)  # 100ms


class MicrophoneStream(object):
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk):
        self._rate = rate
        self._chunk = chunk

        # Create a thread-safe buffer of audio data
        self._buff = queue.Queue()
        self.closed = True

    def __enter__(self):
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self._rate,
            input=True,
            frames_per_buffer=self._chunk,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

        self.closed = False

        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
        """Continuously collect data from the audio stream, into the buffer."""
        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        while not self.closed:
            # Use a blocking get() to ensure there's at least one chunk of
            # data, and stop iteration if the chunk is None, indicating the
            # end of the audio stream.
            chunk = self._buff.get()
            if chunk is None:
                return
            data = [chunk]

            # Now consume whatever other data's still buffered.
            while True:
                try:
                    chunk = self._buff.get(block=False)
                    if chunk is None:
                        return
                    data.append(chunk)
                except queue.Empty:
                    break

            yield b"".join(data)


def listen_print_loop(responses):
    """Iterates through server responses and prints them.
    The responses passed is a generator that will block until a response
    is provided by the server.
    Each response may contain multiple results, and each result may contain
    multiple alternatives; for details, see the documentation.  Here we
    print only the transcription for the top alternative of the top result.
    In this case, responses are provided for interim results as well. If the
    response is an interim one, print a line feed at the end of it, to allow
    the next result to overwrite it, until the response is a final one. For the
    final one, print a newline to preserve the finalized transcription.
    """
    num_chars_printed = 0
    for response in responses:
        if not response.results:
            continue

        # The `results` list is consecutive. For streaming, we only care about
        # the first result being considered, since once it's `is_final`, it
        # moves on to considering the next utterance.
        result = response.results[0]
        if not result.alternatives:
            continue

        # Display the transcription of the top alternative.
        transcript = result.alternatives[0].transcript

        # Display interim results, but with a carriage return at the end of the
        # line, so subsequent lines will overwrite them.
        #
        # If the previous result was longer than this one, we need to print
        # some extra spaces to overwrite the previous result
        overwrite_chars = " " * (num_chars_printed - len(transcript))

        if not result.is_final:
            sys.stdout.write(transcript + overwrite_chars + "\r")
            sys.stdout.flush()

            num_chars_printed = len(transcript)

        else:
            print(transcript + overwrite_chars)

            # Exit recognition if any of the transcribed phrases could be
            # one of our keywords.
            if re.search(r"\b(exit|quit)\b", transcript, re.I):
                print("Exiting..")
                break

            num_chars_printed = 0


def main():
    language_code = "en-US"  # a BCP-47 language tag

    client = speech.SpeechClient()
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=RATE,
        language_code=language_code,
    )

    streaming_config = speech.StreamingRecognitionConfig(
        config=config, interim_results=True
    )

    with MicrophoneStream(RATE, CHUNK) as stream:
        audio_generator = stream.generator()
        requests = (
            speech.StreamingRecognizeRequest(audio_content=content)
            for content in audio_generator
        )

        responses = client.streaming_recognize(streaming_config, requests)

        # Now, put the transcription responses to use.
        listen_print_loop(responses)


if __name__ == "__main__":
    main()
# [END speech_transcribe_streaming_mic]

依赖项是 google-cloud-speechpyaudio

对于 AWS S3,您可以在从 Google Speech API 获取转录本之前/之后将音频文件存储在那里。 流式传输也非常快。

并且不要忘记附上您的凭据。您需要先通过提供 GOOGLE_APPLICATION_CREDENTIALS

获得授权

关于amazon-s3 - 如何将 S3 存储桶中的音频文件直接提供给 Google 语音转文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65422902/

相关文章:

javascript - AWS Lambda 函数和 S3 - 仅当对象发生更改时才更改 S3 中对象的元数据

android - 语音识别语言模型

java - 录制的音频/webm 作为 Google Speech-To-Text API 的输入

java - Spark使用s3a以多线程方式写入文件

node.js - 异步/等待 - 数据未定义

amazon-web-services - 如何正确配置我的 S3 存储桶以供 Transloadit 使用?

java - Windows 7 语音识别的 API?

machine-learning - 声音识别中异常值检测的方法?

javascript - Annyang 将语音转换为文本

java - 在 Java 中使用 Microsoft Project Oxford API