This commit is contained in:
Paul Gauthier 2023-05-23 05:55:07 -07:00
parent b508431766
commit b67ef10c27

View file

@ -2,7 +2,8 @@
# Improving GPT-4's codebase understanding with ctags # Improving GPT-4's codebase understanding with ctags
GPT-4 is extremely useful for "self-contained" coding tasks, GPT-4 is extremely useful for "self-contained" coding tasks,
like generating brand new code or modifying a pure function without dependencies. like generating brand new code or modifying a pure function
that has no dependencies.
But it's difficult to use GPT-4 to modify or extend But it's difficult to use GPT-4 to modify or extend
a large, complex pre-existing codebase. a large, complex pre-existing codebase.
@ -37,35 +38,38 @@ class objects that are required to prepare for the test.
## The problem: code context ## The problem: code context
GPT-4 is great at "self contained" coding tasks, like writing or GPT-4 is great at "self contained" coding tasks, like writing or
modifying a pure function with no external dependencies. These work modifying a pure function with no external dependencies.
well because you can send GPT a self-contained request like "write a GPT can easily handle requests like "write a
Fibonacci function" or "rewrite the loop using list Fibonacci function" or "rewrite the loop using list
comprehensions". These changes require no context beyond the code comprehensions", because they require no context beyond the code
being discussed. being discussed.
Most real code is not pure and self-contained, it is intertwined with Most real code is not pure and self-contained, it is intertwined with
code from many different files in a repo. and depends on code from many different files in a repo.
If you ask GPT to "switch all the print statements in class Foo to If you ask GPT to "switch all the print statements in class Foo to
use the BarLog logging system", it needs to see the code in the Foo class use the BarLog logging system", it needs to see the code in the Foo class
with the prints, and it also needs to understand how the project's BarLog with the prints, and it also needs to understand the project's BarLog
logging system works. subsystem.
A simple solution is to **send the entire codebase** to GPT along with A simple solution is to **send the entire codebase** to GPT along with
each change request. Now GPT has all the context! But this won't work each change request. Now GPT has all the context! But this won't work
for even moderately for even moderately
sized repos that won't fit in the 8k-token context window. sized repos, because they won't fit into the 8k-token context window.
A better approach is to be selective, and **hand pick which files to send**. A better approach is to be selective,
and **hand pick which files to send**.
For the example above, you could send the file that For the example above, you could send the file that
contains Foo and the file that contains the BarLog logging subsystem. contains the Foo class
This works pretty well, and is supported by `aider`: you and the file that contains the BarLog logging subsystem.
This works pretty well, and is supported by `aider` -- you
can manually specify which files to "add to the chat". can manually specify which files to "add to the chat".
But it's not ideal to have to manually identify the right But it's not ideal to have to manually identify the right
set of files to add to the chat. set of files to add to the chat.
Some changes may need context from many files. And sending whole files is a bulky way to send code context,
And you might still overrun wasting the precious 8k context window.
the context window if you need to add many files for context. You may quickly run out of context window if you need to
send many files worth of context.
## Using a repo map to provide context ## Using a repo map to provide context
@ -113,7 +117,7 @@ Mapping out the repo like this provides some benefits:
- GPT can see variables, classes, methods and function signatures from everywhere in the repo. This alone may give it enough context to solve many tasks. For example, it can probably figure out how to use the API exported from a module just based on the details shown in the map. - GPT can see variables, classes, methods and function signatures from everywhere in the repo. This alone may give it enough context to solve many tasks. For example, it can probably figure out how to use the API exported from a module just based on the details shown in the map.
- If it needs to see more code, GPT can use the map to figure out by itself which files it needs to look at. GPT will then ask to see these specific files, and `aider` will automatically add them to the chat context (with user approval). - If it needs to see more code, GPT can use the map to figure out by itself which files it needs to look at. GPT will then ask to see these specific files, and `aider` will automatically add them to the chat context (with user approval).
Of course, for large repositories, even just their map might be too large Of course, for large repositories even just the map might be too large
for the context window. However, this mapping approach opens up the for the context window. However, this mapping approach opens up the
ability to collaborate with GPT-4 on larger codebases than previous ability to collaborate with GPT-4 on larger codebases than previous
methods. It also reduces the need to manually curate which files to methods. It also reduces the need to manually curate which files to
@ -149,10 +153,11 @@ For example, here is the `ctags --fields=+S --output-format=json` output for the
} }
``` ```
The repo map is built using this `ctags` data. The repo map is built using this type of `ctags` data,
Rather then sending the data to GPT using verbose json, `aider` formatting into the space
formats the map as a sorted, efficient hierarchical tree format shown above.
hierarchical tree. This is a format that GPT can easily understand and which efficiently conveys the map data using a This is a format that GPT can easily understand
and which conveys the map data using a
minimal number of tokens. minimal number of tokens.
## Example chat transcript ## Example chat transcript
@ -169,6 +174,27 @@ Using only the meta-data in the map, GPT is able to figure out how to call the m
GPT makes one reasonable mistake writing the first version of the test, but is GPT makes one reasonable mistake writing the first version of the test, but is
able to quickly fix the issue after being shown the `pytest` error output. able to quickly fix the issue after being shown the `pytest` error output.
## Future work
Just as it was inefficient to send "the whole codebase" to GPT with
every request, there are probably better approaches than sending
"the whole repo map" with every request.
Sending a subset of the repo map would help `aider` work
better with even larger repositories which have large maps:
Some possible approaches to reducing the amount of map data are:
- Distill the global map further, to prioritize important symbols and discard "internal" or otherwise less globally relevant identifiers.
- Provide a mechanism for GPT to start with a distilled subset of the global map, and let it ask to see more detail about subtrees or keywords that it feels are relevant to the current coding task.
- Attempt to analyize the natural language coding task given by the user and predict which subset of the repo map is relevant. Possibly by analysis of prior coding chats within the specific repo. Work on certain files or types of features may require certain somewhat predictable context from elsewhere in the repo.
One key goal is to prefer solutions which are language agnostic or
which can be easily deployed against many popular code languages.
The `ctypes` solution has this benefit, since it comes pre-built
with tooling for most populare languages.
I suspect that Language Server Protocol might be another
relevant tool to solve these "code context" problems.
## Try it out ## Try it out
To use this experimental repo map feature: To use this experimental repo map feature: