Why Fake Data is Useful for Data Engineers
One thing that is sometimes really difficult as a Data Engineer is having some data to play with. There are many reasons you might require fake data: learning a new tool, slow access to data sources and PII restrictions can all mean you can't or don't want to use "real" data.
Recently when creating a demo I came across the need to have a good number of tables with large amounts of data which did not contain any real customer data. When researching how to go about this as quickly as possible I came across the Faker Python package. Faker is built for this exact purpose and it's incredibly easy to use.
Getting Started with Faker
Once you have installed the Faker package you then need to import it and create an instance of Faker:
You can then generate some fake data. Let's start by generating a fake person.
Running this will produce something different every time, but the result might look something like this:
If you want the result to be predictable you can "seed" the result:
This will always produce the same result:
It's unlikely that you would want to just generate one person though so let's make a function that we can build from:
Note that in this function I am returning the results as key value pairs. This is because I am going to create JSON files but you can return the data in any format you want.
Using Faker’s Localisation for More Realistic Data
Now we have a function that can generate a fake person; however you may have noticed that the different fields don't have any relation to each other. For example, the email address is not related to the name and the city doesn't seem like it could be a city in that country. This may be fine for your purposes but if you want the data to appear more real you can use a feature of Faker called localisation.
Localisation allows you to create a faker instance with a specific locale and tailor what is created to that locale. For example to create a fake person with the Great Britain locale:
It's also possible to set up multiple locales and have a random one picked which will ensure consistency and variety. To do this I set up faker instances for the U.S.A (en_US), France (fr_FR), Germany (de_DE), Great Britain (en_GB), Italy (it_IT) and Spain (es_ES). At the beginning, the function makes a random choice from a list of all the fake locales:
Generating Fake Data Files
Now we want to actually take our code and build on it so we can generate files of data. This requires another function. This function calls the function we have previously created a set number of times and creates a JSON file of the output. We can make file name, output directory, number of files and rows per file variables in this function to make it flexible. I'm going to create JSON files but this can easily be changed to the format of your choice with a bit of experimentation.
In this example I have set number of files to 10 and rows per file to 100000. I can run this function in 1 minute and 12 seconds. That's a million rows of fake data in just over a minute!
Modifying Faker for Realistic Database Models
So taking this example of fake people, how can this be modified to reflect data that we see all the time in a database. An employee or customer table. Let's generate data for a customer table:
This method can also be modified for any sort of table you want. Take for example a transactions table:
Changing the code just a little we can have a whole different set of data generated.
Customer id is in these files and a minimum and maximum can specified. This means if you have generated fake customer data you can have matching transactions so the transaction data can be joined to the customer data. Product Id is also included so you could also create fake product data and have matching data there too. Joining quantity from transactions and unit price from a product table you can calculate revenue.
Pretty soon you can have a whole data model of fake data!
Conclusion
I hope I have shown how easy it is to get started using Faker and generating limitless amounts of fake data quickly. Other uses for Faker include: testing new products, creating data based on your real schemas for dev or integration testing, and creating massive amounts of data for load testing. Try it out today!
Ready to level up your Data Engineering? Visit our website for more resources, or contact us to see how we can help you streamline your data processes and create efficient, scalable solutions.
Topics Covered :
Author
Ed Oldham