Kontext Realtime
Create and edit images using your voice
Kontext Realtime is an open-source web app that allows you to create and edit images using voice commands. The app uses Cloudflare Workers for serverless functionality, Replicate for AI model access (specifically Flux Schell and Flux Kontext Pro models), and OpenAI's Realtime API for voice processing. This setup enables real-time image manipulation, displaying a webcam feed, snapping photos, removing elements, changing backgrounds, and applying artistic styles, all controlled by spoken instructions.
Run it yourself at kontext-realtime.replicate.dev
Check out the source on GitHub at replicate/kontext-realtime
Demo
Demo Transcript
The words from the demo. Adding this for SEO juice, more than for human benefit.
Hey party people, this is Zeke from Replicate and today I want to show you how to build an app that lets you edit images using your voice. So, this is Kontext Realtime. It is an open-source app that runs on Cloudflare Workers, using Replicate, Black Forest Labs, and OpenAI's Realtime API. Let me give you a demo. Can you hear me? Loud and clear. What's up? Great. Can you show me the webcam? Nice. Can you snap a photo now? Okay, that looks perfect. Um, can you edit the image and remove the pyramid from the background while preserving all other aspects of the image? Awesome. Okay. Can you change the background of the image to be kind of a green color? Nice. And can you add a bunch of uh, livestock in the background, like a herd of camels walking across the desert? Perfect. Okay, can you turn this into a cartoon representation of the same image? Undo. Can you turn this into a 1950s comic book rendition of the same image? (Laughs) I like that more. Okay. Um, can you, um, start over? Let's start over. Let's generate an image of a raccoon, um, wearing a hoodie. Okay, let's turn it into a linocut style artwork. Can you add a text label at the bottom of the image that says "Hoodie"? (Laughs) Okay. That looks pretty good. I'm going to hit pause here. So, uh, OpenAI doesn't interrupt us anymore, but, um, yeah, so this is an open-source app. It is up on GitHub. Um, super fun to play with. Oh, that's the wrong URL. I better fix that. Let's just jump in here and find it. Okay, so we'll go to repositories. Here it is. I'll drop a link in the, the Twitters or the X's or the YouTubes or wherever this video ends up. Um, but the gist is, it's basically a, a pretty simple web app that just is a, uh, Cloudflare Workers app. So, um, the nice thing there is you can have kind of a serverless component to your app where, uh, you can have something that runs in the cloud. So, there's a little bit of code to make that work. That includes code to connect to OpenAI's Realtime API so that you can get, uh, streaming tokens, to be able, uh, ephemeral tokens to be able to use, um, in your browser. Um, and then there's a little bit of code here that does the image generation, using the Flux model. Oops. And the image editing, using the Flux Kontext model from Black Forest Labs. So, very little code on the server side. The majority of the code lives in the public directory. It's basically just a regular old HTML webpage that has CDN hosted React stuff in it. They say you're not supposed to do this in production, but for a demo, it's perfectly adequate. It's a performance thing because it's actually doing like JSX transpilation in your browser, which is a little bit costly, but computers are so fast these days, so who cares. Um, the bulk of the logic for the application is inside this React thing here, and the most interesting part is this series of functions. So, basically what you're doing here is defining a big JavaScript object that has a description, and parameters for all of these functions, and then the actual definition of the function to invoke. So, you hand this giant object to OpenAI and you say, "Hey, let's start chatting," and during the course of the chat, if you hear something from me that sounds like I'm asking for you to run one of these functions, then go ahead and do it. So, that's how that works. So, I encourage you to clone this project, uh, take it for a spin, add your own, uh, new functions to it and have fun. Thanks for watching. Bye.