GenAI and Data Democratization, Part 2

GenAI will allow anyone to access data using plain English

Jun 27, 2024

Introduction

Most of the GenAI buzz is around unstructured data, but the opportunities with other kinds of data are just as exciting. Last week, I talked about using GenAI to clean and structure data. This week, I’m going to talk about how you can use GenAI to search your data using natural language, which really feels like magic.

How GenAI can make data available to anyone

The dream for data nerds inside any organization is that everyone will use their data to make decisions. The problem is that most data is hard to access: it’s in a giant data lake and only available through SQL or Python. Only a few select people who understand the database are the gatekeepers (and bottlenecks).

At Pyxis, Bain’s alternative platform, we deal with the same issue. A team asks for data about a company – (e.g., market share, basket size, cross-shopping, etc.). Then, one of our Pyxis experts writes a SQL query to pull the data from the data lake (which already has structured data as mentioned above). The data is then pulled and either sent to the team or crunched by Pyxis or our Advanced Analytics team. The timing from request to getting the data is typically 24-48 hours, and then it’s another 6-12 hours to crunch the data correctly.

That’s quite fast for such complex data, but it is limiting. If I’m in a meeting with a client, and a question comes up, I can’t pull up data about their performance in real time if I haven’t anticipated the need in advance. The dream is for a consultant to be able to ask the bot for data in plain language, and immediately get it back. That dream is pretty close to reality right now.

Here’s an example of what we’re calling Pyxis GPT:

Which one minute later gets us

So, what’s going on here? You can see the original question is “What’s been the average order value at Lululemon each year from 2019-2023 compared to the overall retail category?” Note that the next thing that happens is that the bot clarifies what I’m looking for within retail.

The clarification step is important. If the bot just assumes that it knows what I want and starts writing a query, it could be wrong, and I might not notice. Additionally, the bot learns my preferences over time, so it won’t ask the same questions over and over. For example, when I ask about average order size, it used to ask whether I meant number of items or dollar amount. After saying “dollar amount” consistently 4-5 times, it stopped asking me (which you can see above).

Once the bot knows what I want (the text in bold above), and I tell it to start, it will write a SQL query and return the answer in raw data and chart form. It takes about 1-2 minutes for the results to come back, much, much faster than a human (24 hours) but not yet conversational. We’re hoping that future improvements in model performance get us to the point where you can chat with data and get instant results.

Here’s another example:

The data shows that Culver’s tends to have more loyal customers than the other brands. An executive at Firehouse, for example, could further cut this by city or by demographics to try to understand why they have fewer repeat visitors and what can be done about it.

Conclusion

So, we are not far from a world where anyone in a company can type questions in natural language and get answers from either the companies’ internal (1P) data or others’ data (3P). This has a lot of exciting implications:

Data driven decisions get easier: if everyone can get numbers near instantly, hopefully, more decisions will be made with data
The end of data gatekeepers: The role of data teams may shift dramatically to being more focused on acquiring and understanding the nuances in the data. They will no longer be the bottleneck to getting answers
Data structuring more important: Getting data into a format where GenAI queries will work becomes a higher priority when more people are using the data. The good news is that structuring the data is easier than ever (see part 1)
Risk of misinterpreting the data: Unfortunately, the downside of this data democratization is that people accessing the data directly may not get the wisdom of the people who understand the nuances.
- A classic example is a retailer where all their online transactions show up as coming from a single location in New Jersey. The Pyxis team knows this quirk and can adjust for it. In the future if people pull the data directly, we need to fix this in the underlying data, but there’s a serious risk that those points don’t make it to the people using the data, and they draw the wrong conclusions.

Hopefully, these posts spark some ideas for how GenAI can help you structure your data and enable your companies to better access and get value from it.

GenAI and Data Democratization, Part 2

GenAI will allow anyone to access data using plain English

Introduction

How GenAI can make data available to anyone

Conclusion

Discussion about this post