aider

mirror of https://github.com/Aider-AI/aider.git synced 2025-05-21 12:55:00 +00:00

Author	SHA1	Message	Date
Paul Gauthier	1c73e7d43a	turn off suggest shell commands during benchmarks	2024-09-05 14:35:34 -07:00
Paul Gauthier	05dcbeecac	noop	2024-09-05 14:25:09 -07:00
Paul Gauthier	ff3a75413b	sonnet+deep got 60.9/82.0	2024-09-05 13:30:25 -07:00
Paul Gauthier	1a3d8c4015	wip	2024-08-20 17:45:40 -07:00
Paul Gauthier	b61b5f4b74	cleanup before merge	2024-08-16 11:35:30 -07:00
Paul Gauthier	bac04a2a3d	no lint	2024-08-15 06:10:46 -07:00
Paul Gauthier	060c8ff89a	override dotenv	2024-08-13 18:06:00 -07:00
Paul Gauthier	139f7992cb	do not pass pretty to coder	2024-08-13 17:43:41 -07:00
Paul Gauthier	ca18220b77	num_with_malformed_responses	2024-05-19 14:19:06 -07:00
Paul Gauthier	70b1c0c20c	load .env in benchmark.py	2024-05-07 13:32:19 -07:00
Paul Gauthier	ecca737803	added deepseek-chat v2	2024-05-07 06:26:39 -07:00
Paul Gauthier	b1cae73a85	cleaned up csv output	2024-05-07 05:59:31 -07:00
Paul Gauthier	a7b08c7354	format output as yaml	2024-05-06 11:15:19 -07:00
Paul Gauthier	3162d42262	cleanup	2024-05-06 10:46:09 -07:00
Paul Gauthier	5fb7a323ec	refactored plots	2024-05-06 10:44:34 -07:00
Paul Gauthier	3bb237bdc1	handle tasks with exceptions in the stats output	2024-05-05 08:24:45 -07:00
Paul Gauthier	9cdd9e12c3	catch all exceptions in the benchmark	2024-05-04 17:52:46 -07:00
Paul Gauthier	9b88f8caf6	updated gpt-4-0314	2024-05-04 07:59:27 -07:00
Paul Gauthier	f6580fff76	updated all openai models	2024-05-04 07:38:50 -07:00
Paul Gauthier	1981105932	aider: Implemented the TODO to extract the version from aider/__init__.py for each commit hash. # Aider chat conversation: USER: do the todo ASSISTANT: Here is the SEARCH/REPLACE block to do the TODO in the code: benchmark/benchmark.py <source>python <<<<<<< SEARCH for hsh in variants['commit_hash']): # TODO: get the output from `git show {hsh}:aider/__init__.py` ======= for hsh in variants['commit_hash']: try: version = subprocess.check_output( ['git', 'show', f'{hsh}:aider/__init__.py'], universal_newlines=True ) version = re.search(r'__version__ = "(.*)"', version).group(1) csv.append(version) except subprocess.CalledProcessError: csv.append('unknown') >>>>>>> REPLACE </source> This change: 1. Loops through each commit hash in `variants['commit_hash']` 2. Uses `subprocess.check_output` to run the `git show` command and capture the contents of `aider/__init__.py` at that commit 3. Searches the file contents for the `__version__ = "X.Y.Z"` line and extracts the version string 4. Appends the version to the `csv` list, or `'unknown'` if there was an error getting the file contents The `try/except` handles cases where the commit hash doesn't exist or `aider/__init__.py` is missing.	2024-05-04 07:14:23 -07:00
Paul Gauthier	01282674d4	Add pass rates to CSV output in benchmark results summary.	2024-05-04 07:13:40 -07:00
Paul Gauthier	4461c7c4b2	fixed benchmark	2024-04-23 09:44:04 -07:00
Paul Gauthier	fd5b9bbfcb	Added groq llama3	2024-04-22 07:12:01 -07:00
Paul Gauthier	434fa5f6a7	updated benchmark to new Coder & Model classes	2024-04-19 15:21:24 -07:00
Paul Gauthier	7875418183	fix column order	2024-04-09 18:11:08 -07:00
Paul Gauthier	00f1cdb561	Added gpt-4-turbo vision blog post	2024-04-09 16:55:35 -07:00
Paul Gauthier	ac39791fee	fixed mislabelled gpt-4 column	2024-03-09 08:20:27 -08:00
Paul Gauthier	f5887a5098	tweaking graph labels	2024-02-03 08:25:19 -08:00
Paul Gauthier	9033be74bf	Initial benchmark results for 0125	2024-01-25 13:00:16 -08:00
Joshua Vial	93f32d3855	make benchmark listen to openai_api_base env var	2023-12-21 09:38:54 +13:00
Joshua Vial	9e656945fe	Merge remote-tracking branch 'upstream/main' into gpt4-vision	2023-12-21 09:29:32 +13:00
Joshua Vial	d4e663f7bc	benchmark work with openrouter	2023-12-20 10:27:33 +13:00
Paul Gauthier	755b3858eb	copy	2023-12-19 11:11:58 -08:00
Paul Gauthier	e3c8fac604	copy	2023-12-18 10:20:40 -08:00
Paul Gauthier	b0c03820e9	copy	2023-12-18 10:19:38 -08:00
Paul Gauthier	16534e914b	better graph	2023-12-18 10:02:52 -08:00
Paul Gauthier	6ab2db192c	Added udiff graph	2023-12-18 09:53:28 -08:00
Paul Gauthier	7113a30271	unified diffs	2023-12-17 12:54:34 -08:00
Paul Gauthier	cab7460f94	catch 404s from azure on models.list	2023-12-07 07:44:21 -08:00
Paul Gauthier	bf03f43b44	fixed benchmark	2023-12-05 11:21:11 -08:00
Paul Gauthier	6ebc142377	roughed in openai 1.x	2023-12-05 07:37:05 -08:00
Paul Gauthier	1d0bc3dcb6	Updated benchmark reports	2023-11-14 16:03:50 -08:00
Paul Gauthier	8f8a32f325	aider: Fix NameError in show_stats function by passing graphs as a parameter. # Aider chat conversation: USER: I ran this command: ./benchmark/benchmark.py tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 --stats And got this output: ─────────────────────────────── tmp.benchmarks/2023-11-07-02-37-08--gpt-3.5-turbo-1106 ─────────────────────────────── test-cases: 133 model: gpt-3.5-turbo-1106 edit_format: whole commit_hash: f658a65-dirty num_error_outputs: 2 num_user_asks: 2 num_exhausted_context_windows 0 test_timeouts: 2 41.4% correct after try 0 55.6% correct after try 1 duration: 3.1 sec/test-case costs: $0.0037/test-case, $0.49 total, $0.49 projected ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last): File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 772, in <module> app() File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 328, in __call__ raise e File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 716, in main return _main( ^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/.venv/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 353, in main return show_stats(updated_dirnames) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/gauthier/Projects/aider/./benchmark/benchmark.py", line 124, in show_stats if graphs: ^^^^^^ NameError: name 'graphs' is not defined ASSISTANT: It appears that the `graphs` variable is being used in the `show_stats` function without being defined or passed as an argument to the function. To fix this issue, we need to modify the `show_stats` function to accept `graphs` as a parameter and then pass the `graphs` variable when the function is called from the `main` function. Here are the steps to fix the issue: 1. Modify the `show_stats` function definition to accept a new parameter `graphs`. 2. Update the call to `show_stats` in the `main` function to pass the `graphs` variable. Let's make these changes in the `benchmark/benchmark.py` file. <source>python benchmark/benchmark.py <<<<<<< SEARCH def show_stats(dirnames): ======= def show_stats(dirnames, graphs): >>>>>>> REPLACE </source> <source>python benchmark/benchmark.py <<<<<<< SEARCH return show_stats(updated_dirnames) ======= return show_stats(updated_dirnames, graphs) >>>>>>> REPLACE </source> Please apply these changes to the `benchmark/benchmark.py` file to resolve the `NameError`.	2023-11-08 14:15:08 -08:00
Paul Gauthier	9351f438ac	aider: Add a cmd line switch --graphs and only draw them if it's provided. # Aider chat conversation: USER: add a cmd line switch --graphs and only draw them if its provided ASSISTANT: To add a command-line switch `--graphs` and conditionally generate graphs based on its presence, we need to modify the `benchmark.py` file. Specifically, we will add a new option to the `main` function to accept the `--graphs` switch and then use this option to control the calls to `plot_timing` and `plot_outcomes` functions. Here are the steps we will follow: 1. Add the `--graphs` option to the `typer` command-line interface in the `main` function. 2. Use the value of this option to conditionally call the `plot_timing` and `plot_outcomes` functions. Let's make these changes: <source>python benchmark/benchmark.py <<<<<<< SEARCH @app.command() def main( dirnames: List[str] = typer.Argument(..., help="Directory names"), ======= @app.command() def main( dirnames: List[str] = typer.Argument(..., help="Directory names"), graphs: bool = typer.Option(False, "--graphs", help="Generate graphs"), >>>>>>> REPLACE ... <<<<<<< SEARCH plot_timing(df) plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg) ======= if graphs: plot_timing(df) plot_outcomes(df, repeats, repeat_hi, repeat_lo, repeat_avg) >>>>>>> REPLACE </source> With these changes, the `--graphs` switch will be available when running the `benchmark.py` script, and the graph generation will only occur if this switch is provided.	2023-11-08 14:13:59 -08:00
Paul Gauthier	ed0dc95a10	copy	2023-11-08 11:16:56 -08:00
Paul Gauthier	f2a3e489e1	labels	2023-11-08 10:26:22 -08:00
Paul Gauthier	02b604adc3	Add labels to the legend in the speed graph	2023-11-08 10:20:28 -08:00
Paul Gauthier	867a0cc33f	aider: Add labels to the bars in the `plot_timing` function to display them correctly in the legend. # Aider chat conversation: ASSISTANT: To address the issue with the empty legend for the edit format in the `plot_timing` function, we need to add labels to the bars so that the legend can display them correctly. We will modify the `benchmark/benchmark.py` file to include labels for each edit format. Here's the change that needs to be made: <source>python benchmark/benchmark.py <<<<<<< SEARCH for i, fmt in enumerate(formats): edge = dict(edgecolor="#ffffff", linewidth=1.5) color = "#b3e6a8" if "diff" in fmt else "#b3d1e6" hatch = "////" if "func" in fmt else "" rects = ax.bar( pos + i * width, grouped[fmt], width * 0.95, color=color, hatch=hatch, zorder=zorder + 1, *edge, ) ======= # Create a mapping for the labels and colors to ensure consistency format_labels = { 'diff': 'Diff', 'func': 'Function', 'whole': 'Whole', 'diff-func': 'Diff-Function', 'diff-whole': 'Diff-Whole', 'func-whole': 'Function-Whole', # Add other edit formats here if necessary } format_colors = { 'diff': '#b3e6a8', 'func': '#b3d1e6', 'whole': '#e6b3b3', 'diff-func': '#e6d1b3', 'diff-whole': '#b3e6d1', 'func-whole': '#d1b3e6', # Add other edit format colors here if necessary } format_hatches = { 'diff': '////', 'func': '....', 'whole': '', 'diff-func': '\\\\\\\\', 'diff-whole': 'xxxx', 'func-whole': '----', # Add other edit format hatches here if necessary } for i, fmt in enumerate(formats): label = format_labels.get(fmt, fmt) color = format_colors.get(fmt, '#b3d1e6') hatch = format_hatches.get(fmt, '') edge = dict(edgecolor="#ffffff", linewidth=1.5) rects = ax.bar( pos + i width, grouped[fmt], width * 0.95, label=label, color=color, hatch=hatch, zorder=zorder + 1, **edge, ) >>>>>>> REPLACE </source> This change adds a mapping for labels, colors, and hatches for each edit format. It also ensures that each bar in the bar chart has a label associated with it, which will be used to generate the legend. The colors and hatches are placeholders and should be adjusted to match the desired styling.	2023-11-08 10:11:47 -08:00
Paul Gauthier	eed4864af1	Changed the label of the y-axis in the plot to "Average GPT response time per exercise (sec)".	2023-11-08 10:11:46 -08:00
Paul Gauthier	207bbbc4df	aider: Implemented the `plot_timing` function to create a bar chart showing the average duration of each (model, edit_format) combination, with consistent styling as `plot_outcomes`. # Aider chat conversation: ASSISTANT: To implement the `plot_timing` function and style the graph similarly to `plot_outcomes`, we will follow these steps: 1. Define the `plot_timing` function to create a bar chart that shows the average duration for each combination of model and edit format. 2. Use similar styling options as used in `plot_outcomes` for consistency. Let's start by implementing the `plot_timing` function in the `benchmark/benchmark.py` file. <source>python benchmark/benchmark.py <<<<<<< SEARCH def plot_timing(df): """plot a graph showing the average duration of each (model, edit_format)""" ======= def plot_timing(df): """plot a graph showing the average duration of each (model, edit_format)""" plt.rcParams["hatch.linewidth"] = 0.5 plt.rcParams["hatch.color"] = "#444444" from matplotlib import rc rc("font", *{"family": "sans-serif", "sans-serif": ["Helvetica"], "size": 10}) fig, ax = plt.subplots(figsize=(6, 4)) ax.grid(axis="y", zorder=0, lw=0.2) zorder = 1 grouped = df.groupby(["model", "edit_format"])["avg_duration"].mean().unstack() num_models, num_formats = grouped.shape pos = np.array(range(num_models)) width = 0.8 / num_formats formats = grouped.columns models = grouped.index for i, fmt in enumerate(formats): edge = dict(edgecolor="#ffffff", linewidth=1.5) color = "#b3e6a8" if "diff" in fmt else "#b3d1e6" hatch = "////" if "func" in fmt else "" rects = ax.bar( pos + i width, grouped[fmt], width * 0.95, color=color, hatch=hatch, zorder=zorder + 1, *edge, ) ax.bar_label(rects, padding=4, labels=[f"{v:.1f}s" for v in grouped[fmt]], size=6) ax.set_xticks([p + 0.5 width for p in pos]) ax.set_xticklabels(models) ax.set_ylabel("Average duration (seconds)") ax.set_title("GPT Code Editing - Timing") ax.legend( title="Edit Format", loc="upper left", ) ax.set_ylim(top=max(grouped.max()) * 1.1) # Set y-axis limit to 10% more than the max value plt.tight_layout() plt.savefig("tmp_timing.svg") imgcat(fig) >>>>>>> REPLACE </source> This code block adds the implementation of the `plot_timing` function, which creates a bar chart to visualize the average duration of tests for each model and edit format. The styling is kept consistent with the `plot_outcomes` function, using similar colors, hatches, and font settings. The resulting graph is saved as an SVG file and displayed in the terminal using `imgcat`.	2023-11-08 10:07:29 -08:00

1 2 3 4 5

247 commits