The News
OpenAI recently released it’s realtime API. This enables you to directly stream responses as audio. No need for a separate text-to-speech process.
Twilio released a short tutorial for javascript and python, showing how to use websockets to bi-directionally stream the conversation over the phone.
The Task
I’m going to show you my method for acheiving this using Ruby. You can view the source code on Github or watch a video version on youtube.
The Issue
There’s really not much documentation relating to Ruby asynchonous programming, or LLM programming compared to javascript and python.
The Solution
Samuel Williams has done some great work providing tools for asynchronous ruby. Watch his talks on youtube, I recommend it. We will be be using his Async gem and Falcon webserver (which helps run Rack apps asynchronously with support HTTP2).
The Why?
Personal preference. I’m a rubyist, and would prefer to code with what I’m comfortable with.
The Challenge
To fit this all in one blog post. You may just want to view the source on Github or watch a video.
The Steps
- Register a Twilio number
- Register an OpenAI account
- Set up an NGrok tunnel
- Create a Rack app
- Route incoming calls
- Connect Twilio socket
- Connect OpenAI socket
- Handle Events (i.e stream audio)
- ??? Profit
1. Register a Twilio number
This is the most complicated part imho. Twilio has always seemed so over the top. I’m sure it never used to be this way. Unfortunately, I can’t help you do this. Please see the Twilio Docs for help.
2. Register an OpenAI account
Register for a paid platform account to access the API. Anyone with a standard platform account can access the new realtime API.
3. Set up an NGrok tunnel
You will need an address for Twilio to route inbound call to. You set up your own webserver if you want. I used NGrok to create a public facing url that routes to my localhost for free.
ngrok http localhost:9292
4. Create a Rack app
Now the fun starts! We want to keep this as lightweight as possible so we’re going to create a barebones Rack app. First install your gems.
bundle init
bundle add rack dotenv twilio-ruby falcon
bundle add async async-http async-websocket protocol-rack
The all we need is a config.ru
file for our Rack app.
class Application
def self.call(env)
Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws']) do |connection|
while message = connection.read
puts message.parse
end
end or [200, {'content-type' => 'text/html'}, ["Hello World"]]
end
end
run Application
Notice we’re jumping straight in with the websockets. We’ve also added a hello world fallback. Right now, any route on the server will either try to connect a websocket or display “Hello World”.
Now run the falcon server to see if we have any errors.
falcon serve --bind http://localhost:9292 -c config.ru
Opening in the browser will display the “Hello World” text. But we can could create a client.rb
to test the websocket!
require 'async'
require 'async/http/endpoint'
require 'async/websocket/client'
Async do |task|
# apln_protocols is a workaround for NGrok fake http2
endpoint = Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
Async::WebSocket::Client.connect(endpoint) do |connection|
input_task = task.async do
while line = $stdin.gets
connection.write(Protocol::WebSocket::TextMessage.generate({
text: line.chomp
}))
connection.flush
end
end
connection.write(Protocol::WebSocket::TextMessage.generate({
status: "connected"
}))
while message = connection.read
puts "[response] #{message.inspect}"
end
ensure
input_task&.stop
end
end
You should be able to run the client and send messages to the server by typing and hitting return. These messages will simply be echoed back.
ruby client.rb
5. Route incoming calls
Lets start answering some calls! Visit Twilio to set up the required webhook. Direct the incoming calls to your host and choose a path. I’m choosing /incoming-call
in this demo.
We need to create a separate route in the Rack app and also respond with some TwiML that Twilio can understand. You can use the hand Twilio::TwiML
module to generate all of the XML for you.
class IncomingCall
def self.call(env)
response = Twilio::TwiML::VoiceResponse.new
response.say(message: "Connecting you to an agent.")
[200, {"content-type" => "application/xml"}, [response.to_s]]
end
end
class Application
def self.call(env)
request = Rack::Request.new(env)
if request.path == '/incoming-call'
IncomingCall.(env)
else
default(env)
end
end
def self.default(env)
Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws']) do |connection|
while message = connection.read
puts message.parse
end
end or [200, {'content-type' => 'text/html'}, ["Hello World"]]
end
end
run Application
We’ve created a separate route that should now respond with a voice message if you call the number. You can either use your real phone to test this or you can use the twilio dev-phone. I like to use the dev-phone because I’m a cheapskate.
6. Connect Twilio socket
We need to modify the /incoming-call
path to tell Twilio to open a websocket to our server so we can pass it to openai. I’m setting up another route called /ai-stream
to handle this connection.
class IncomingCall
def self.call(env)
response = Twilio::TwiML::VoiceResponse.new
response.say(message: "Connecting you to an agent.")
url = "wss://#{ENV["HOST"]}/ai-stream"
connect = Twilio::TwiML::Connect.new.stream(url: url)
response.append(connect)
[200, {"content-type" => "application/xml"}, [response.to_s]]
end
end
Here’s the new websocket route. It will simply print the incoming data to the server console log.
class AiStream
def self.call(env)
Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|
while message = twilio.read
puts message.parse
end
puts 'DISCONNECTED'
end
end
end
I’ve created a sub class of Async::WebSocket::Connection
and passed it to the websocket adapter so that I can dump business logic into it.
class Socket::Twilio < Async::WebSocket::Connection
def initialize(*, **)
super
end
end
7. Connect OpenAI socket
This is similar to the previous step except we will need to create a new Async routine to stream at the same time.
You will need your OpenAI API key for this step. I’ve got this saved in a .env
file. Using the dotenv
gem.
URL = 'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01'
HEADERS = [
["Authorization", "Bearer #{ENV['OPENAI_API_KEY']}"],
["OpenAI-Beta", "realtime=v1"]
]
class AiStream
def self.call(env)
Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|
openai = Async::WebSocket::Client.connect(openai_endpoint, headers: HEADERS, handler: Socket::OpenAi)
openai_task = Async do
while message = openai.read
puts message.parse
end
end
while message = twilio.read
puts message.parse
end
puts 'DISCONNECTED'
ensure
openai&.close
openai_task&.stop
end
end
def self.openai_endpoint
Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
end
end
class Socket::OpenAI < Async::WebSocket::Connection
def initialize(*, **)
super
end
end
8. Handle Events
Now that we’re streaming from both sources, we need to actually do something with that data and connect the two. Essentially we just need to react to the events coming down the connections. We should parse the data, create a response and send it down the other socket.
There’s a lot of business logic coming, so I’ve created separate class called Bridge
. Which handles all of the events and passes the data to the socket classes we created.
A list of events available can be found here:
require 'base64'
class Bridge
TWILIO_EVENTS = [
'start',
'media',
'mark'
]
OPENAI_EVENTS = [
'response.audio.delta',
'input_audio_buffer.speech_started',
]
def initialize(twilio, openai)
@twilio = twilio
@openai = openai
end
def handle_twilio(message)
begin
event = message[:event]
if TWILIO_EVENTS.include? event
send('twilio_' + event.to_s, message)
end
# rescue
# handle error here
end
end
def handle_openai(message)
begin
event = message[:type]
if OPENAI_EVENTS.include? event
send('openai_' + event.gsub('.', '_'), message)
end
# rescue
# handle error here
end
end
private
# -------------
# Twilio EVENTS
# -------------
def twilio_start(message)
end
def twilio_media(message)
end
def twilio_mark(message)
end
# -------------
# OpenAI EVENTS
# -------------
def openai_response_audio_delta(message)
end
def openai_input_audio_buffer_speech_started(message)
end
end
Here’s my /ai-stream
route now. All we’re doing is opening the connections and passing the events to the bridge, which handles all of the logic.
class AiStream
def self.call(env)
Async::WebSocket::Adapters::Rack.open(env, protocols: ['ws'], handler: Socket::Twilio) do |twilio|
openai = Async::WebSocket::Client.connect(openai_endpoint, headers: HEADERS, handler: Socket::OpenAi)
bridge = Bridge.new twilio, openai
openai_task = Async do
while message = openai.read
bridge.handle_openai(message.parse)
end
end
while message = twilio.read
bridge.handle_twilio(message.parse)
end
puts 'DISCONNECTED'
ensure
openai&.close
openai_task&.stop
end
end
def self.openai_endpoint
Async::HTTP::Endpoint.parse(URL, alpn_protocols: Async::HTTP::Protocol::HTTP11.names)
end
end
9. ??? Profit
I recommend you view the github repo to view the business logic. It’s very boiler plate and the main takeaway from this post should be how to use the async
gem to enable bi-directional streaming.
Next Step
This opens a lot of possibilities. You will definitely want to implement auth of some kind. Perhaps not Okta. You could create different endpoints to potentially make outbound calls. Perhaps use TwiML to create a menu to select the prefered AI prompt to use. The choice is yours.
Conclusion
Rubyists need not worry! Python can stay in it’s pen for now. The async gem really is great, and I can see myself getting a lot of use out of it.
To anyone reading this, please reach out with any comments.