By Sampras Lopes, Nishant Joshi & Sanchith Hegde

Everyone knows that Rust is the language of the gods. Here’s how we still managed to shoot ourselves in the foot. We were gearing up for a release and had just deployed to our internal test environment. We were not particularly seeking any adventure that evening but our application promptly crashed. We checked the logs to quickly find out that we had reached the thread stack limit.

On redeploying after enabling backtraces(RUST_BACKTRACE = 1), we learnt that Rust does not print a backtrace for stack overflows. Completely clueless, we rolled back to the latest stable version.

The code was working fine locally. So, we tried reproducing the issue in another test environment, the failures were intermittent and random, adding to our woes. We decided to get our hands dirty and run the code on `rust-lldb` with restricted stack size. After multiple runs, we saw some of the frames had many KBs of stack usage.

We tried bisecting through the commits and after a long agonizing wait, were able to locate the commit causing the issue.

Dormammu, I've Come to Bargain

We reached the offending function and as it turned out, the bug was literally one word long. See if you can spot it.

Consider the example below:

We have 2 structs here. The `PsqlWrapper` is a struct that allows us to connect to the Postgres Database. In addition, we implemented a `KafkaWrapper` around the `PsqlWrapper`. Inside the `KafkaWrapper`, we generated a kafka event where needed, and intended to call the underlying function from the `PsqlWrapper`. So, we just used `PsqlWrapper` as a field within `KafkaWrapper`.

But instead of calling the function from the `PsqlWrapper`, we ended up calling the one from the `KafkaWrapper`. So, the fix here was simple.

We caused an unconditional recursion unbeknownst to the Rust compiler. What?

There was no reason to suspect the Kafka wrapper because the failing API relied on `do_something_second` which we knew didn’t add any logic at all. We also used the `PsqlWrapper` directly in the local setup. That’s why it worked fine locally.

Who’s to blame here?

Mostly us, but we wanted to understand how this evaded the ever watchful rust compiler. In cases of unconditional recursion, Rust provides us with a helpful warning indicating that there is a recursion. But, in this case neither Rust nor Clippy gave us any hint. Why is that the case? Is it because of `async`? turns out, it isn’t. In the case of `async`, Rust is quite strict. It throws an error at compile time. How did rust fail to catch this problem? It's how the async is implemented here that allowed the issue to sneak through.

`async_trait` is the second culprit under investigation. Currently, Rust on its own doesn’t support async traits, so someone came up with a clever way to achieve it. (GG `dtolnay/async_trait`). They used the `dyn Future`. Fun fact, Rust itself recommends using the `dyn Future` if we want to achieve recursion. However Rust doesn’t warn us about unconditional_recursion when we use it.

If you’re curious, here’s the general idea behind how async_trait (the crate) achieves async trait (the behavior). It takes a function that returns `impl Future<Output = …>` and converts it into `Pin<Box<dyn Future<Output = …>>>`. This acts like a concrete type and Rust allows it to be the return type of a function inside traits.

(`Pin<Box<...>>` just pins the internal value to a specific location in memory, and prevents it from moving)

A function that returns an `impl Future`:

A function that returns a `dyn Future`:

However, it’s clear that this is something that was overlooked while writing as well as reviewing our code. Got to love a language that makes you trust it so much.

It's a compiler not a Jedi, don't expect it to read minds.

So, when we talk about Rust being very secure, there comes a point where one must ponder about just how much responsibility we can leave on the compiler. Even with all the brilliant features that the compiler provides there are always some things that the compiler might not catch. This is where the precision of the developer and the keen eye of the reviewer matter the most.


Join us in building HyperSwitch
Our Belief

 Payments should be open, fast, reliable and affordable to serve the billions of people at scale.

Globally payment diversity has been growing at a rapid pace. There are hundreds of payment processors and new payment methods like BNPL, RTP etc. Businesses need to embrace this diversity to increase conversion, reduce cost and improve control. But integrating and maintaining multiple processors needs a lot of dev effort. Why should devs across companies repeat the same work? Why can't it be unified and reused? Hence, HyperSwitch was born to create that reusable core and let companies build and customize on top of it.

Want to contribute? 

Check out some of our good first issues here.
Try Hyperswitch. Get your API keys here.