copy

2025-05-29 08:44:59 +00:00 · 2023-05-23 05:55:07 -07:00 · 2023-05-23 05:55:07 -07:00 · b67ef10c27
commit b67ef10c27
parent b508431766
1 changed files with 46 additions and 20 deletions
--- a/docs/ctags.md
+++ b/docs/ctags.md
@ -2,7 +2,8 @@
 # Improving GPT-4's codebase understanding with ctags
 GPT-4 is extremely useful for "self-contained" coding tasks,
-like generating brand new code or modifying a pure function without dependencies.
+like generating brand new code or modifying a pure function
 that has no dependencies.
 But it's difficult to use GPT-4 to modify or extend
 a large, complex pre-existing codebase.
@ -37,35 +38,38 @@ class objects that are required to prepare for the test.
 ## The problem: code context
 GPT-4 is great at "self contained" coding tasks, like writing or
-modifying a pure function with no external dependencies. These work
+modifying a pure function with no external dependencies.
-well because you can send GPT a self-contained request like "write a
+GPT can easily handle requests like "write a
 Fibonacci function" or "rewrite the loop using list
-comprehensions". These changes require no context beyond the code
+comprehensions", because they require no context beyond the code
 being discussed.
 Most real code is not pure and self-contained, it is intertwined with
-code from many different files in a repo.
+and depends on code from many different files in a repo.
 If you ask GPT to "switch all the print statements in class Foo to
 use the BarLog logging system", it needs to see the code in the Foo class
-with the prints, and it also needs to understand how the project's BarLog
+with the prints, and it also needs to understand the project's BarLog
-logging system works.
+subsystem.
 A simple solution is to **send the entire codebase** to GPT along with
 each change request. Now GPT has all the context! But this won't work
 for even moderately
-sized repos that won't fit in the 8k-token context window.
+sized repos, because they won't fit into the 8k-token context window.
-A better approach is to be selective, and **hand pick which files to send**.
+A better approach is to be selective,
 and **hand pick which files to send**.
 For the example above, you could send the file that
-contains Foo and the file that contains the BarLog logging subsystem.
+contains the Foo class
-This works pretty well, and is supported by `aider`: you
+and the file that contains the BarLog logging subsystem.
 This works pretty well, and is supported by `aider` -- you
 can manually specify which files to "add to the chat".
 But it's not ideal to have to manually identify the right
-set of files to add to the chat. 
+set of files to add to the chat.
-Some changes may need context from many files.
+And sending whole files is a bulky way to send code context,
-And you might still overrun
+wasting the precious 8k context window.
-the context window if you need to add many files for context.
+You may quickly run out of context window if you need to
 send many files worth of context.
 ## Using a repo map to provide context
@ -113,7 +117,7 @@ Mapping out the repo like this provides some benefits:
  - GPT can see variables, classes, methods and function signatures from everywhere in the repo. This alone may give it enough context to solve many tasks. For example, it can probably figure out how to use the API exported from a module just based on the details shown in the map.
  - If it needs to see more code, GPT can use the map to figure out by itself which files it needs to look at. GPT will then ask to see these specific files, and `aider` will automatically add them to the chat context (with user approval).
-Of course, for large repositories, even just their map might be too large
+Of course, for large repositories even just the map might be too large
 for the context window.  However, this mapping approach opens up the
 ability to collaborate with GPT-4 on larger codebases than previous
 methods.  It also reduces the need to manually curate which files to
@ -149,10 +153,11 @@ For example, here is the `ctags --fields=+S --output-format=json` output for the
 }
 ```
-The repo map is built using this `ctags` data.
+The repo map is built using this type of `ctags` data,
-Rather then sending the data to GPT using verbose json, `aider`
+formatting into the space
-formats the map as a sorted,
+efficient hierarchical tree format shown above.
-hierarchical tree. This is a format that GPT can easily understand and which efficiently conveys the map data using a
+This is a format that GPT can easily understand
 and which conveys the map data using a
 minimal number of tokens.
 ## Example chat transcript
@ -169,6 +174,27 @@ Using only the meta-data in the map, GPT is able to figure out how to call the m
 GPT makes one reasonable mistake writing the first version of the test, but is
 able to quickly fix the issue after being shown the `pytest` error output.
 ## Future work
 Just as it was inefficient to send "the whole codebase" to GPT with
 every request, there are probably better approaches than sending
 "the whole repo map" with every request.
 Sending a subset of the repo map would help `aider` work
 better with even larger repositories which have large maps:
 Some possible approaches to reducing the amount of map data are:
  - Distill the global map further, to prioritize important symbols and discard "internal" or otherwise less globally relevant identifiers.
  - Provide a mechanism for GPT to start with a distilled subset of the global map, and let it ask to see more detail about subtrees or keywords that it feels are relevant to the current coding task.
  - Attempt to analyize the natural language coding task given by the user and predict which subset of the repo map is relevant. Possibly by analysis of prior coding chats within the specific repo. Work on certain files or types of features may require certain somewhat predictable context from elsewhere in the repo.
 One key goal is to prefer solutions which are language agnostic or
 which can be easily deployed against many popular code languages.
 The `ctypes` solution has this benefit, since it comes pre-built
 with tooling for most populare languages.
 I suspect that Language Server Protocol might be another
 relevant tool to solve these "code context" problems.
 ## Try it out
 To use this experimental repo map feature: