- Jul 25, 2024
- 5 minutes
- Yohann Doillon
- Mostafa Ajallooeian
- Jan Słowik
- Hanna Jarlaczyńska
Many businesses face a critical challenge when adopting new technologies: dealing with old, outdated code. This code is often difficult to maintain, and rewriting it in a new technology is usually both time-consuming and resource-intensive. With the advent of Generative AI, old code can now be translated from an outdated language to a more modern and performant one, allowing businesses to migrate code across stacks with minimal resources. In this article, we explore three approaches leveraging generative AI to migrate old SQL queries to Python code, as described in the following video.
Approach #1 - SQL to SPARK SQL
As a first approach, we attempted to migrate SQL code from an Oracle database to SPARK SQL code. The GPT3.5 model, when used with the right prompts, provided good results for simple SQL queries with little human intervention needed. However, it struggled with complex queries, such as recursive common table expressions, and it showed limitations on the length of code it could handle. When implementing the newer GPT4 model, the quality of results improved significantly, but there was still a limitation on the length of code that could be translated.
Approach #2 - SQL to PySpark
In a second attempt, we focused on translating SQL code to PySpark, as Python provides more flexibility in finding solutions. Similarly to the first approach, the model performed well with simple queries, but struggled with more complex ones, again due to limitations on the context length.
It is worth noting that some important considerations remain with such an approach, such as the risk of exceeding the allowed limit, waiting time for model responses, and the cost of usage. Yet, we can expect future releases of large language models to handle longer queries as their context windows increase.
Approach #3 - SQL to PySpark (Divide & Conquer)
In a third attempt, we decided to break down large, complex SQL queries into smaller parts using abstract syntax trees and build larger queries from the partial results. This approach provided greater transparency to the developer, and made it easier to trace and correct errors. It also prompted more strategic business questions, such as questioning the need to maintain a monolithic view built from 3,000 (!) lines of code.
Importantly, this approach was significantly faster because smaller queries could be processed using smaller and faster models. Yet, we were also faced with incorrect query translations due to inconsistent use of aliases in the code, which caused issues with column ambiguity in Spark. The model was able to partially handle this problem, occasionally making minor mistakes that could be fixed with followup prompts focusing on fixing the subquery. Interestingly, asking GPT3.5 several times increased the chances of obtaining a good enough solution compared to waiting for a GPT4 response.
In summary, these different GenAI approaches can provide significant support for developers struggling with the migration of legacy code. The preliminary results are extremely promising, with the resulting code allowing us to reproduce the logic embedded in SQL in a simple way. Besides the technological advancements, we should emphasize the need of having a human in the loop, to cater for the highlighted limitations, especially for the most complex queries.