Building a tRUSTworthy web service 🦀
This blog was written by me when I was interning at Holmusk, a digital health company. The product mentioned blow is what I was working on with Rust during my time there.
I worked on FoodDX, which is a service which helps people get insights on how to improve their diet with the help from proprietary AI technology and personalized feedback from nutrition experts. In its current stage, it helps score images of food taken through an app, and gives it a score from 1-5, and also provides personalized tips for the food.
It's a big project with a lot of components - the AI models, the app, and the backend infrastructure which handles it all. We'll be taking a look at the backend in this article.
Before we dive into the infrastructure, it would help to take a look at what an image goes through once it enters our system.
Haskell is used for the client facing API, and it's used for other ad-hoc tasks such as reading/writing to a database among others.
Rust is used for image preprocessing, model inference and sending the results back. Sending of results was done differently in the two approaches outlined below. Rust was chosen for its high efficiency, small executable footprint, and absence of a garbage collector. It also had strong type system, speed & relatively actively maintained Tensorflow (client) library.
The internal organization of the rust service in this architecture is outlined above. There were 3 main parts, all running concurrently on 3 separate tokio runtimes - namely polling SQS, preprocessing and running inference on the images, and cleanup tasks (like writing results to redis, notifying SQS that the image can now be taken off the queue, etc).
The external processes related to this architecture are outlined below.
The main gripe we had was in the
SQS upload event notification. In our benchmarks, it was very slow, and we aim for the service to have a very low latency, with the goal being that every image that comes into the system should be scored/rated in under 1 second . Because of the way the system was designed, this meant that we'd need a pretty big makeover on the rust side, and some tweaks on the haskell side if we were to get closer to meeting our performance goals. This is also mentioned in the AWS docs, where they state that
Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
As mentioned above, the main reason for redesigning the architecture was to avoid the
SQS upload event notification as low latency is of high priority in this project. In the process, we found out that we actually simplified it, by removing unnecessary moving parts.
Internally, the Rust service now also has a webserver. The client facing API (written in Haskell) proxies the HTTP requests it receives to the Rust server via a load balancer (AWS ELB). In this version of the architecture, we completely eliminate the use of a queue (
We have 3 tokio runtimes running simultaneous and somewhat independently of one another. These tokio runtimes communicate with each other using messages that are passed between them using bounded channels. The "messages" we pass are custom
Structs we define for communication.
Finally, because each request handler needs a result for its own image, the handler initially creates a oneshot for receiving it's results and this is passed along as metadata for the image. Once the image is inferred in a batch, the data is sent back to the image's corresponding request handler so the results can be returned.
As mentioned, we have completely avoided the use of SQS in this architecture. The external architecture around the rust service now looks like this:
We use a couple of different channels for communication with different parts within the rust service. Check out this chapter from the rust book for some more context on how they work!
std::sync::mpsc: This is the only
syncchannel we use (rest are
async). We use it to communicate to the
mainfunction that the models have been loaded. Since the
sync, we use the builtin synchronous channel rust provides.
The other channels are
async, meaning they wouldn't block the runtime while
awaiting for a result. They instead would pass the control back to the async runtime (
tokio in this case) and other tasks can be performed. The
async channels are :
tokio::sync::oneshot: A oneshot is a channel which has only one reciever and one sender. The handler keeps the
Receiverand sends its
Senderaround the program. Once the processing is finished (a batch of requests are processed at a time) the oneshot is used to send the result back to the handler of that specific request, maintaining the one-one mapping of the request and response that's required.
async_channel::bounded, which we use like a MPSC (Multi Producer, Single Consumer) channel to pass data between many response handlers to the batching task, for example. It's used for communication between tasks. We'd like to use
tokio::sync::mpsc, but :
These are our takeaways for using Rust in this project!
rusotolibrary. Because this library had a dependency with Tokio
0.1.15, we couldn't migrate to Tokio
1.xfor a really long time. We were able to do it later when
rusotowas updated, but we still expected such a critical library to stay up to date. Things are looking good however, with AWS announcing that they are working on an official SDK for Rust.
We also have some general takeaways and gotchas we encountered in this project:
Cbindings were not built/available for a large variety of GPU instances we use in AWS. This proved to be a little tedious to fix, as we had to manually compile tensorflow for the systems we use in production, without which we experienced slow inference times and model loading.
The image hash is calculated and used to check for duplicates. ↩︎
Tokio is an asynchronous runtime for the Rust programming language. A lot of languages have a built in async runtime. Rust allows you to choose whichever runtime you require. Tokio is the most popular option in the Rust ecosystem. Check out this resource for more insight into async and the rust async ecosystem! ↩︎
Connect and reach out to me!