synthetic-data-generation-finance-1-3c2ca2.jpg Once upon the summer of 2019, we had a problem. And no, it wasn’t just that our name was still the far nerdier “Dataminer” — thankfully, we quickly fixed that.

You see, at Salv we had just pivoted from a fraud platform to an AML (anti-money laundering) platform, but we had the same problem every new startup had. We didn’t have customers. And, well, to find customers we needed to show them our product. And to show them our product, we needed data. And where the heck were we going to find data just lying around?

That’s when, as data scientists, we had a dream. A dream to have a large enough, diverse enough, complex enough, and fresh enough dataset that we could demo our product to potential customers and, ultimately, find the kinds of criminal networks that would enable us to test our software and try out real-life transaction monitoring scenarios.

And, well, to do that, we needed data somehow.

synthetic-data-generation-finance-2.jpg

How did you get data?

To get data to show prospective customers, we had a few options:

  1. Use real bank data. While this would have been a dream, in real life, it’s never an option. Sad for us data nerds.
  2. Use the prospective customer’s data in hashed format. That would require a bunch of work on behalf of the customer upfront just to see our product, and we didn’t think they’d be likely to agree. Plus, if it was a small fintech that hadn’t launched, they wouldn’t even have data to offer. And, regardless, hashing is hard work.
  3. Use real-life non-bank data and make it into bank transactions. This could be possible. For example, I found a huge set of real life New York taxi data with persons, transaction amounts, start times, end times, etc. That could work. We could use that data as the base and then build bank-type transaction information on top. While not a bad idea, the problem came in when we started thinking about the fact that all the datasets, including taxi data, were finite. If we wanted to scale the data from 10,000 rows to 3,000,000 rows, it would be tough. Sure, we could do a little swapping and duplicating magic, but it wouldn’t be the same.
  4. Create bank-like data ourselves. As our team of data scientists talked more and more, we realized there were some basic algorithms we could use to generate life-like bank-like transactions. A base set of naming conventions. Behavioural rules. A population registry that resembled the real world. Once we set it up, it would continue generating life-like bank data on its own. It could scale.

Which option did you pick?

Well, the title probably gave our choice away.

In order to protect the innocent, in R, we created a realistic world of people, businesses, transactions, and crooks — none of which ever existed. Well, except for the occasional sanctioned entity. And that’s how Fakebank was born. Full of all kinds of fake transactions, people, companies, and crime, Fakebank is exactly what we data scientists at Salv were dreaming of.

Officially, Fakebank is a database of artificially-generated financial transactions. Unofficially, if I do say so myself, it’s far more interesting.

Last summer, when we launched Fakebank, it had 0 customers. As I write today, Fakebank now has 102,000 — and growing. But we’ve built is so that we can expand to even something like 3 million in a matter of days. We just haven’t seen the need. Yet.

If you’re a data geek like me, then get excited, because I’m going to tell you all the nitty gritty details of how we created it. If you’re not a data geek, well, I think you still may find the process interesting. Or, at the very least, you can impress your math nerd friends the next time you have lunch.

synthetic-data-generation-finance-3.jpg

What services does Fakebank offer?

To be frank, Fakebank isn’t exactly a high street bank — its offerings are a bit sparse. At the moment, Fakebank has the following financial services implemented:

But, if you’re a prospective client worried that Fakebank is too simple, we can easily add a laundry list of Fakebank services like debit card payments and loans if needed. At least, we can add services way easier than you can launch them in real life. No offense.

synthetic-data-generation-finance-4-d1dbce.jpg

Who uses Fakebank?

As you might have suspected, Fakebank operates in Fakenation as a part of Fakeworld. As an official member state, Fakenation has an individual population registry and a business registry. And it is to this set of fake people and fake businesses in its fake nation that Fakebank markets its financial products and services.

The good news is that Fakebank has some pretty good marketing. Which is why, once an hour, a bunch of people and businesses from those population and business registries are so impressed by Fakebank’s adverts that they become Fakebank customers.

Voila! Converted customers. Wish it was that easy in real life, eh?

Fakenation’s individual and business registries

If we pull back the curtains for a second, you’ll see that when we started this project, we combined our data science nerd power, a lot of googling, and our nearest friends in the financial industry. Which is how we came up with a base list of 44 individuals and 23 business types.

Fun fact, if you look through the names and addresses, you’ll see they often have an Estonian flavour. Or have a very not Estonian name but happen to live in Estonia in a town that doesn’t actually exist. Go figure for a team of 5 Estonian data scientists.

estonian-addresses.png Apparently “Kacy Wunsch”, definitely not an Estonian name, happily lives in Estonia.

From there, we added a really long list of various characteristics, and then options for each characteristic. Top it off with a random name generator and, boom, you have the basis of Fakenation’s individual and business registries.

As mentioned before, it’s from these registries that Fakebank “converts” entities into customers.

The reason we decided to start our dataset with a with registries as our base is many-fold:

  1. We can set up complex shareholder networks for businesses.
  2. We can use the registries to populate data for multiple banks.
  3. We can generate a network of transactions between banks.
  4. We can create user networks between banks, because banks X and Y can also be set up to convert entities from the same register.

Fakebank’s customer profiles

Just like in real life, each person and business from those Fakenation registries are unique. Thus, each Fakebank customer has their own set of unique data that makes up that person or business. For example — Sly Therin, private person, 30-39 years old, employee, averages 40 transactions per month, minimum outgoing 1€, maximum outgoing 3000€, min incoming 1600€, maximum incoming 2000€.

Fakebank customer profiles have data on:

Fakebank customer bank accounts

Once an individual or business succumbs to the marketing ploys of Fakebank and signs up to become a Fakebank customer, we automatically assign them anywhere from 1 to 5 bank accounts inside Fakebank, along with a corresponding currency tied to each specific account. Just like a real human.

Today, Fakebank converts around 300 new customers per day, but that will increase gradually over time.

Generating the data

As mentioned, impressively, Fakebank manages to convert more customers — individuals and businesses alike — once per hour. And these customers then, obviously, generate their own transactions. Transactions data is set to generate every minute.

Currently as I write, Fakebank has around 9,000 transactions per hour and 80,000 transactions a day. Not bad growth for a startup bank, eh?

synthetic-data-generation-finance-5.jpg

How do you spot suspicious activity and entities?

If all of Fakebank’s customers were completely innocent, then you and I probably wouldn’t care. And, well, that would be pretty boring. Especially if you’re testing an AML platform like our potential customers are.

So, good news! Or, rather, bad news — just like in real life, Fakebank has a bunch of shady characters and transactions happening all the time. At random intervals, shady transactions are generated inside Fakebank. Usually every day a few SAR (suspicious activity report) cases are created.

Defining suspicious activity

To define what was suspicious, inside Salv, a bunch of people sat together and put together a list of set scenarios that would be considered suspicious activity. There’s a long list of these actions, but, for example it might mean quickly emptying an account after a large sum is received. Or a customer who normally transacts in smaller amounts suddenly sends a massive amount. Or even if someone just receives an unusually high amount from someone in a high-risk country.

The more fun part is that we made sure that around 10% of the cases would set off some sort of alert to check, but then, when you dig deeper, you’ll find the activity is quite normal. The real criminals are less than 1%.

Sanctioned counterparties

Sanctions is the one scenario in which Fakebank doesn’t use fake data — we deemed it vital in order to test the efficacy of KYC procedures. That’s why, at random intervals, Fakebank customers, counterparties and business shareholders are drawn from the Dow Jones’ Sanctioned Persons database to check against.

Counterparties

As mentioned, Fakebank is a part of Fakenation. And Fakebank customers all come from Fakenation’s individual and business registries.

However, the same isn’t necessarily true of customer counterparties. If the Fakebank customer is involved with a counterparty from outside of the Fakenation registry, then it is generated from a separate registry of Fake-ForeignNation Counterparties. And, as you might imagine, different fake foreign nations also hold different levels of risk. Smart, right?

Currently, around 90% of Fakebank transactions are generated by using connections with entities listed in the Fakenation’s individuals’ register. About 9% are generated randomly with a random counterparty selected from the counterparty database.

About 1% of transactions have completely random counterparty (generated at the moment of transaction).

Built-in complexities and networks

Life is complex — even in Fakeworld. Which is why, in order to mimic real life situations, we built some complexities into the data:

suspicious-transaction-amounts-synthetic-data.png A few small transactions and then… a bit bigger one.

synthetic-data-generation-finance-6.jpg

What’s in a transaction?

Finally, you’ll probably want to know about the actual transactions. We did our best to make them resemble real ones — along with auxiliary tables like account details, account balance, credit card details etc.

But, as we mentioned earlier, Fakebank doesn’t have a complex product offering. So these transactions are modeled off of a commercial bank that offers these types of limited services.

Included transaction information

For every transaction you’ll find in Fakebank, you’ll also find all of this basic information:

synthetic-data-generation-finance-7.jpg

How do you generate individual and business user datasets?

Non-data people, this is a little boring, so you can probably skip to the next section. But if you want the details, here’s the order:

Generating individual user transactions

Salv’s seed data for individuals → Data is then generated into Fakenation’s individuals registry → Becomes a Fakebank customer → Transactions are generated according to customer’s individual profiles or the customer’s connections.

Generating business transactions

Salv’s seed data for businesses → Data is then generated into Fakenation’s business registry (with some UBOs from the individuals registry) → Business then becomes a Fakebank customer → Then transactions are generated according to that business profile

synthetic-data-generation-finance-8.jpg

How have you benefited from generating your own data?

When we began building Fakebank our use case was rather narrow, as time went on we realized there were some massive benefits to having our very own fake customer.

salv-hackathon-aml-agents-1.jpg Several teams of Salvers working on being an AML agent for a day.

synthetic-data-generation-finance-9.jpg

That’s just an overview

All right all right. Fascinating details over.

But, believe it or not, this really is just a quick overview. If you’re interested in how you can try out Fakebank, learning more about how it works, or think you might want to use the dataset, then get in touch directly with me: [email protected]

This is what gets me up in the morning.