Ruby Async Websockets with Twilio & OpenAI Realtime API

The News

OpenAI recently released it’s realtime API. This enables you to directly stream responses as audio. No need for a separate text-to-speech process.

Twilio released a short tutorial for javascript and python, showing how to use websockets to bi-directionally stream the conversation over the phone.

The Task

I’m going to show you my method for acheiving this using Ruby. You can view the source code on Github or watch a video version on youtube.

The Issue

There’s really not much documentation relating to Ruby asynchonous programming, or LLM programming compared to javascript and python.

The Solution

Samuel Williams has done some great work providing tools for asynchronous ruby. Watch his talks on youtube, I recommend it. We will be be using his Async gem and Falcon webserver (which helps run Rack apps asynchronously with support HTTP2).

The Why?

Personal preference. I’m a rubyist, and would prefer to code with what I’m comfortable with.

The Challenge

To fit this all in one blog post. You may just want to view the source on Github or watch a video.

The Steps

  1. Register a Twilio number
  2. Register an OpenAI account
  3. Set up an NGrok tunnel
  4. Create a Rack app
  5. Route incoming calls
  6. Connect Twilio socket
  7. Connect OpenAI socket
  8. Handle Events (i.e stream audio)
  9. ??? Profit

1. Register a Twilio number

This is the most complicated part imho. Twilio has always seemed so over the top. I’m sure it never used to be this way. Unfortunately, I can’t help you do this. Please see the Twilio Docs for help.

2. Register an OpenAI account

Register for a paid platform account to access the API. Anyone with a standard platform account can access the new realtime API.

3. Set up an NGrok tunnel

You will need an address for Twilio to route inbound call to. You set up your own webserver if you want. I used NGrok to create a public facing url that routes to my localhost for free.

ngrok http localhost:9292

4. Create a Rack app

Now the fun starts! We want to keep this as lightweight as possible so we’re going to create a barebones Rack app. First install your gems.

bundle init
bundle add rack dotenv twilio-ruby falcon
bundle add async async-http async-websocket protocol-rack

The all we need is a config.ru file for our Rack app.

class Application
  def self.call(env)
    Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws']) do |connection|
      while message = connection.read
        puts message.parse
      end
    end or [200, {'content-type' => 'text/html'}, ["Hello World"]] 
  end
end

run Application

Notice we’re jumping straight in with the websockets. We’ve also added a hello world fallback. Right now, any route on the server will either try to connect a websocket or display “Hello World”.

Now run the falcon server to see if we have any errors.

falcon serve --bind http://localhost:9292 -c config.ru

Opening in the browser will display the “Hello World” text. But we can could create a client.rb to test the websocket!

require 'async'
require 'async/http/endpoint'
require 'async/websocket/client'

Async do |task|
  # apln_protocols is a workaround for NGrok fake http2
  endpoint = Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
	
	Async::WebSocket::Client.connect(endpoint) do |connection|
		input_task = task.async do
			while line = $stdin.gets
		    connection.write(Protocol::WebSocket::TextMessage.generate({
          text: line.chomp
        }))
				connection.flush
			end
		end
		
		connection.write(Protocol::WebSocket::TextMessage.generate({
			status: "connected"
		}))
		
		while message = connection.read
			puts "[response] #{message.inspect}"
		end
	ensure
		input_task&.stop
	end
end

You should be able to run the client and send messages to the server by typing and hitting return. These messages will simply be echoed back.

ruby client.rb

5. Route incoming calls

Lets start answering some calls! Visit Twilio to set up the required webhook. Direct the incoming calls to your host and choose a path. I’m choosing /incoming-call in this demo.

We need to create a separate route in the Rack app and also respond with some TwiML that Twilio can understand. You can use the hand Twilio::TwiML module to generate all of the XML for you.

class IncomingCall
  def self.call(env)
    response = Twilio::TwiML::VoiceResponse.new
    response.say(message: "Connecting you to an agent.")

    [200, {"content-type" => "application/xml"}, [response.to_s]]
  end
end

class Application
  def self.call(env)
    request = Rack::Request.new(env)
    if request.path == '/incoming-call'
      IncomingCall.(env)
    else
      default(env)
    end
  end

  def self.default(env)
    Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws']) do |connection|
      while message = connection.read
        puts message.parse
      end
    end or [200, {'content-type' => 'text/html'}, ["Hello World"]] 
  end
end

run Application

We’ve created a separate route that should now respond with a voice message if you call the number. You can either use your real phone to test this or you can use the twilio dev-phone. I like to use the dev-phone because I’m a cheapskate.

6. Connect Twilio socket

We need to modify the /incoming-call path to tell Twilio to open a websocket to our server so we can pass it to openai. I’m setting up another route called /ai-stream to handle this connection.

class IncomingCall
  def self.call(env)
    response = Twilio::TwiML::VoiceResponse.new
    response.say(message: "Connecting you to an agent.")

    url = "wss://#{ENV["HOST"]}/ai-stream"
    connect = Twilio::TwiML::Connect.new.stream(url: url)
    response.append(connect)

    [200, {"content-type" => "application/xml"}, [response.to_s]]
  end
end

Here’s the new websocket route. It will simply print the incoming data to the server console log.

class AiStream
  def self.call(env)
    Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|

      while message = twilio.read
        puts message.parse
      end

      puts 'DISCONNECTED'
    end
  end
end

I’ve created a sub class of Async::WebSocket::Connection and passed it to the websocket adapter so that I can dump business logic into it.

class Socket::Twilio < Async::WebSocket::Connection
  def initialize(*, **)
    super
  end
end

7. Connect OpenAI socket

This is similar to the previous step except we will need to create a new Async routine to stream at the same time. You will need your OpenAI API key for this step. I’ve got this saved in a .env file. Using the dotenv gem.

URL = 'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01'
HEADERS = [
  ["Authorization", "Bearer #{ENV['OPENAI_API_KEY']}"],
  ["OpenAI-Beta", "realtime=v1"]
]

class AiStream
  def self.call(env)
    Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|

      openai = Async::WebSocket::Client.connect(openai_endpoint, headers: HEADERS, handler: Socket::OpenAi)

      openai_task = Async do
        while message = openai.read
          puts message.parse
        end
      end

      while message = twilio.read
        puts message.parse
      end

      puts 'DISCONNECTED'

    ensure
      openai&.close
      openai_task&.stop
    end
  end

  def self.openai_endpoint
	  Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
  end
end

class Socket::OpenAI < Async::WebSocket::Connection
  def initialize(*, **)
    super
  end
end

8. Handle Events

Now that we’re streaming from both sources, we need to actually do something with that data and connect the two. Essentially we just need to react to the events coming down the connections. We should parse the data, create a response and send it down the other socket.

There’s a lot of business logic coming, so I’ve created separate class called Bridge. Which handles all of the events and passes the data to the socket classes we created.

A list of events available can be found here:

require 'base64'

class Bridge
  TWILIO_EVENTS = [
    'start',
    'media',
    'mark'
  ]

  OPENAI_EVENTS = [
    'response.audio.delta',
    'input_audio_buffer.speech_started',
  ]

  def initialize(twilio, openai)
    @twilio = twilio
    @openai = openai
  end

  def handle_twilio(message)
    begin
      event = message[:event]
      if TWILIO_EVENTS.include? event
        send('twilio_' + event.to_s, message)
      end
    # rescue
      # handle error here
    end
  end
  def handle_openai(message)
    begin
      event = message[:type]
      if OPENAI_EVENTS.include? event
        send('openai_' + event.gsub('.', '_'), message)
      end
    # rescue
      # handle error here
    end
  end

  private

  # -------------
  # Twilio EVENTS
  # -------------
  def twilio_start(message)
  end
  def twilio_media(message)
  end
  def twilio_mark(message)
  end

  # -------------
  # OpenAI EVENTS
  # -------------
  def openai_response_audio_delta(message)
  end
  def openai_input_audio_buffer_speech_started(message)
  end
end

Here’s my /ai-stream route now. All we’re doing is opening the connections and passing the events to the bridge, which handles all of the logic.

class AiStream
  def self.call(env)
    Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|

      openai = Async::WebSocket::Client.connect(openai_endpoint, headers: HEADERS, handler: Socket::OpenAi)
      bridge = Bridge.new twilio, openai

      openai_task = Async do
        while message = openai.read
          bridge.handle_openai(message.parse)
        end
      end

      while message = twilio.read
        bridge.handle_twilio(message.parse)
      end

      puts 'DISCONNECTED'

    ensure
      openai&.close
      openai_task&.stop
    end
  end

  def self.openai_endpoint
	  Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
  end
end

9. ??? Profit

I recommend you view the github repo to view the business logic. It’s very boiler plate and the main takeaway from this post should be how to use the async gem to enable bi-directional streaming.

Next Step

This opens a lot of possibilities. You will definitely want to implement auth of some kind. Perhaps not Okta. You could create different endpoints to potentially make outbound calls. Perhaps use TwiML to create a menu to select the prefered AI prompt to use. The choice is yours.

Conclusion

Rubyists need not worry! Python can stay in it’s pen for now. The async gem really is great, and I can see myself getting a lot of use out of it.

To anyone reading this, please reach out with any comments.