Commit graph

3849 commits

Author SHA1 Message Date
Paul Gauthier
d376767ca3 Use gpt-3.5-turbo-0125 for commit messages 2024-02-10 07:32:26 -08:00
Paul Gauthier
6ddfc894e7 Updated HISTORY 2024-02-10 07:31:04 -08:00
Paul Gauthier
bdef4308fe Simpler calls to pypandoc 2024-02-08 16:14:17 -08:00
Paul Gauthier
efff174f9a Use download_pandoc, which works everywhere including arm64 2024-02-08 15:56:00 -08:00
Paul Gauthier
2dee76378b keep hrefs 2024-02-08 15:19:00 -08:00
Paul Gauthier
62c67a1e23 copy 2024-02-08 14:41:51 -08:00
Paul Gauthier
115bf10978 Added /web to commands doc 2024-02-08 14:39:18 -08:00
Paul Gauthier
50b6a00331 bs4 -> beautifulsoup4 2024-02-08 14:29:56 -08:00
Paul Gauthier
05c74202ec copy 2024-02-08 14:20:49 -08:00
Paul Gauthier
cc36329691 output with pandoc, cleanup with bs and re 2024-02-08 14:14:42 -08:00
Paul Gauthier
be60b785a4 aider: Remove all data: URLs from slimdown_html.
# Aider chat conversation:

USER: in `slimdown_html` remove all data: urls

ASSISTANT: ```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def slimdown_html(soup):
     # Remove all <img> tags
     for img in soup.find_all('img'):
         img.decompose()
+    # Remove all elements with data: URLs
+    for tag in soup.find_all(href=lambda x: x and x.startswith('data:')):
+        tag.decompose()
+    for tag in soup.find_all(src=lambda x: x and x.startswith('data:')):
+        tag.decompose()
     # Remove all per-element CSS styles
     for tag in soup.find_all(True):
         tag.attrs.pop('style', None)
     # Remove all internal anchor elements
     for anchor in soup.find_all('a', href=True):
         if anchor['href'].startswith('#'):
             anchor.decompose()
     return soup
```
2024-02-08 12:12:03 -08:00
Paul Gauthier
84ae51c005 Add functionality to write scraped content to a temporary file and optimize HTML parsing. 2024-02-08 12:11:59 -08:00
Paul Gauthier
33a62c8dc6 aider: finish slimdown_html. remove all <img> tags and all per-element css-styles and all internal anchor elements
# Aider chat conversation:

USER: finish `slimdown_html`. remove all <img> tags and all per-element css-styles and all internal anchor elements

ASSISTANT: ```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
-def slimdown_html(page_source: str) -> str:
-    soup = BeautifulSoup(page_source, "html.parser")
-    # ...
+def slimdown_html(page_source: str) -> str:
+    soup = BeautifulSoup(page_source, "html.parser")
+    # Remove all <img> tags
+    for img in soup.find_all('img'):
+        img.decompose()
+    # Remove all per-element CSS styles
+    for tag in soup.find_all(True):
+        tag.attrs.pop('style', None)
+    # Remove all internal anchor elements
+    for anchor in soup.find_all('a', href=True):
+        if anchor['href'].startswith('#'):
+            anchor.decompose()
+    return str(soup)
```
2024-02-08 12:10:12 -08:00
Paul Gauthier
1520bb976a Change html_to_text to html_to_markdown and add slimdown_html function. 2024-02-08 12:10:10 -08:00
Paul Gauthier
1509e21698 aider: Implemented html_to_markdown using pypandoc.
# Aider chat conversation:

USER: use pypandoc to implement `html_to_markdown`

ASSISTANT: To implement `html_to_markdown` using `pypandoc`, I will:

1. Add an import for `pypandoc`.
2. Implement the `html_to_markdown` function using `pypandoc.convert_text`.

Here are the diffs for those changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 import sys
+import pypandoc
 import httpx
 from bs4 import BeautifulSoup
 from playwright.sync_api import sync_playwright
@@ ... @@
 def html_to_markdown(page_source: str) -> str:
-    pass
+    return pypandoc.convert_text(page_source, 'md', format='html')
```

Please ensure that `pypandoc` is installed in your environment to use this implementation. If it's not installed, you can install it using `pip install pypandoc`.
2024-02-08 12:02:45 -08:00
Paul Gauthier
75e7f2bd4e Add html_to_markdown function to convert HTML to Markdown. 2024-02-08 12:02:42 -08:00
Paul Gauthier
681f26d010 Print playwright instructions after the content is displayed, so they are not lost 2024-02-08 12:01:18 -08:00
Paul Gauthier
e42005a78e plumbed through to /web 2024-02-08 11:47:28 -08:00
Paul Gauthier
5b78d929a6 Roughly working scraper 2024-02-08 11:44:55 -08:00
Paul Gauthier
9bf3a6e0c6 aider: Implemented HTTP request using httpx library in scrape_with_httpx method.
# Aider chat conversation:

USER: finish `scrape_with_httpx`

ASSISTANT: To finish the `scrape_with_httpx` method, I will implement the HTTP request using the `httpx` library. This will involve making a GET request to the provided URL and returning the content of the response.

Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
     def scrape_with_httpx(self, url):
-        pass
+        import httpx
+        headers = {
+            'User-Agent': aider_user_agent
+        }
+        try:
+            with httpx.Client(headers=headers) as client:
+                response = client.get(url)
+                response.raise_for_status()
+                return response.text
+        except httpx.HTTPError as http_err:
+            self.print_error(f'HTTP error occurred: {http_err}')
+        except Exception as err:
+            self.print_error(f'An error occurred: {err}')
+        return None
```

This implementation uses `httpx.Client` to send a GET request with a custom User-Agent header. It handles any HTTP errors by printing them out and returns `None` in case of an error. If the request is successful, it returns the text content of the response.
2024-02-08 11:29:46 -08:00
Paul Gauthier
e04187d1ad Refactor scraper to use Playwright for web scraping and handle user agent string. 2024-02-08 11:29:44 -08:00
Paul Gauthier
f10a9d2e7b aider: Retrieve the user agent from the browser context after creating a new page in the scrape_with_playwright function.
# Aider chat conversation:

USER: determine the user_agent

ASSISTANT: To determine the `user_agent` in the `scrape_with_playwright` function, we can use the `browser.new_context()` method which allows us to specify a `user_agent` string. We will set a common user agent string for a desktop browser.

Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def scrape_with_playwright(url):
     with sync_playwright()
^C KeyboardInterrupt

USER: I don't want to SET it. I want to determine what it is

ASSISTANT: To determine the `user_agent` being used by the browser in the `scrape_with_playwright` function, we can retrieve it from the browser context after creating a new page. Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def scrape_with_playwright(url):
     with sync_playwright() as p:
         try:
             browser = p.chromium.launch()
         except Exception as e:
             print(repr(e))
             return
+        page = browser.new_page()
+        user_agent = page.evaluate("navigator.userAgent")
+        print(f"User Agent: {user_agent}")
-        page = browser.new_page()
         page.goto(url)
         content = page.content()
         browser.close()

     return content
```
2024-02-08 11:02:54 -08:00
Paul Gauthier
65111591d7 Add scraper module for data extraction. 2024-02-08 11:02:52 -08:00
Paul Gauthier
e2acc028d0 aider: Set the user agent to 'Aider v0.24.0-dev'.
# Aider chat conversation:

USER: set the user agent to 'Aider v0.24.0-dev'

ASSISTANT: ```diff
--- playw.py
+++ playw.py
@@ ... @@
 def main(url):
     # Check if Chromium is installed, if not, install it
     with sync_playwright() as p:
-        p.chromium.launch()
+        browser = p.chromium.launch(user_agent='Aider v0.24.0-dev')
+        page = browser.new_page()
+        page.goto(url)
+        content = page.content()
+        print(content)
+        browser.close()

 if __name__ == "__main__":
     if len(sys.argv) < 2:
```

USER: UnifiedDiffNoMatch: hunk failed to apply!

playw.py does not contain lines that match the diff you provided!
Try again.
DO NOT skip blank lines, comments, docstrings, etc!
The diff needs to apply cleanly to the lines in playw.py!

playw.py does not contain these 7 exact lines in a row:
```
def main(url):
    # Check if Chromium is installed, if not, install it
    with sync_playwright() as p:
        p.chromium.launch()

if __name__ == "__main__":
    if len(sys.argv) < 2:
```

ASSISTANT: I apologize for the oversight. Here is the corrected diff:

```diff
--- playw.py
+++ playw.py
@@ ... @@
     with sync_playwright() as p:
-        browser = p.chromium.launch()
+        browser = p.chromium.launch(user_agent='Aider v0.24.0-dev')
         page = browser.new_page()
         page.goto(url)
         #page.wait_for_load_state('networkidle')
         content = page.content()
         print(content)
         browser.close()
```
2024-02-08 07:59:12 -08:00
Paul Gauthier
5a06508e0e Install Chromium if not already installed before launching it. 2024-02-08 07:58:59 -08:00
Paul Gauthier
09de0f5d43 aider: Add check to see if Playwright needs to install Chromium and its dependencies.
# Aider chat conversation:

USER: include a check to see if playwright needs to install chromium and it's deps

ASSISTANT: Here is the updated `playw.py` file with the check to see if Playwright needs to install Chromium and its dependencies:

```diff
--- playw.py
+++ playw.py
@@ ... @@
 import sys
+from playwright.__main__ import main as playwright_install

 def main(url):
+    # Check if Chromium is installed, if not, install it
+    try:
+        with sync_playwright() as p:
+            p.chromium.launch()
+    except Exception as e:
+        print("Chromium is not installed. Installing necessary dependencies...")
+        playwright_install(['install', 'chromium'])
+
     with sync_playwright() as p:
         browser = p.chromium.launch()
         page = browser.new_page()
         page.goto(url)
         content = page.content()
         print(content)
         browser.close()

 if __name__ == "__main__":
     if len(sys.argv) < 2:
         print("Usage: python playw.py <URL>")
         sys.exit(1)
     main(sys.argv[1])
```

This diff adds a check before the main function to see if Chromium can be launched. If an exception occurs, it assumes Chromium is not installed and proceeds to install it using Playwright's command-line interface.
2024-02-08 07:24:46 -08:00
Paul Gauthier
d1782d1c87 aider: Remove asynchronous code and use synchronous Playwright API.
# Aider chat conversation:

USER: don't use async

ASSISTANT: Here is the updated `playw.py` file without using async:

```diff
--- playw.py
+++ playw.py
@@ ... @@
-import asyncio
-from playwright.async_api import async_playwright
+from playwright.sync_api import sync_playwright
 import sys

-async def main(url):
-    async with async_playwright() as p:
-        browser = await p.chromium.launch()
-        page = await browser.new_page()
-        await page.goto(url)
-        content = await page.content()
-        print(content)
-        await browser.close()
+def main(url):
+    with sync_playwright() as p:
+        browser = p.chromium.launch()
+        page = browser.new_page()
+        page.goto(url)
+        content = page.content()
+        print(content)
+        browser.close()

 if __name__ == "__main__":
     if len(sys.argv) < 2:
         print("Usage: python playw.py <URL>")
         sys.exit(1)
-    asyncio.run(main(sys.argv[1]))
+    main(sys.argv[1])
```

This diff removes the asynchronous code and replaces it with the synchronous version provided by Playwright's `sync_api`. The `async` and `await` keywords are removed, and the `async_playwright` context manager is replaced with `sync_playwright`. The script should now run synchronously.
2024-02-08 07:21:23 -08:00
Paul Gauthier
a2649cabe9 aider: Added Playwright script to fetch text from a supplied URL.
# Aider chat conversation:

USER: write a playwright script to fetch the text from a url supplied on the cmd line

ASSISTANT: To create a Playwright script that fetches text from a URL supplied on the command line, I will write the necessary code in the `playw.py` file. The script will use Playwright's Python package to launch a browser, navigate to the provided URL, and print the text content of the page to the console.

Here is the diff for the new `playw.py` file:

```diff
--- /dev/null
+++ playw.py
@@ ... @@
+import asyncio
+from playwright.async_api import async_playwright
+import sys
+
+async def main(url):
+    async with async_playwright() as p:
+        browser = await p.chromium.launch()
+        page = await browser.new_page()
+        await page.goto(url)
+        content = await page.content()
+        print(content)
+        await browser.close()
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python playw.py <URL>")
+        sys.exit(1)
+    asyncio.run(main(sys.argv[1]))
```

This script will require the Playwright package to be installed. You can install it using `pip install playwright` and then run `playwright install` to install the necessary browser binaries. To run the script, use the command `python playw.py <URL>`, replacing `<URL>` with the actual URL you want to fetch text from.
2024-02-08 07:20:50 -08:00
Paul Gauthier
1758937042 wip 2024-02-08 07:18:39 -08:00
Paul Gauthier
61531b9430 aider: Add chromedriver-autoinstaller to requirements.in.
# Aider chat conversation:

USER: I added these *read-write* files: requirements.in

ASSISTANT: It seems there might be a misunderstanding. You've indicated that you've added the `requirements.in` file, but you haven't specified any changes that need to be made to it or any other files. If you need any modifications or have any specific requests regarding the `requirements.in` or any other file, please let me know, and I'll be happy to assist you with the necessary changes.

USER: add chromedriver-autoinstller to .in not .txt

ASSISTANT: Here is the diff to add `chromedriver-autoinstaller` to the `requirements.in` file:

```diff
--- requirements.in
+++ requirements.in
@@ ... @@
 webdriver_manager
 selenium
+chromedriver-autoinstaller
```

This change adds `chromedriver-autoinstaller` to the `requirements.in` file, which is used to generate the `requirements.txt` file with `pip-compile`. After adding this line, you would run `pip-compile requirements.in` to update the `requirements.txt` file accordingly.
2024-02-07 16:30:30 -08:00
Paul Gauthier
99be34fb61 aider: Add chromedriver-autoinstaller to requirements.txt for automatic installation of ChromeDriver.
# Aider chat conversation:

USER: does the user need to install anything beyond the pip requirements to have selenium drive chrome?

ASSISTANT: To have Selenium drive Chrome, the user needs to have the Chrome browser installed on their system and also the ChromeDriver executable, which allows Selenium to interface with Chrome. The ChromeDriver version must be compatible with the version of Chrome installed on the system.

The `requirements.txt` file already includes `selenium`, but it does not include `chromedriver-autoinstaller`, which can automatically download and install the correct version of ChromeDriver for the user's version of Chrome. To facilitate this, we can add `chromedriver-autoinstaller` to the `requirements.txt` file.

Here is the diff to add `chromedriver-autoinstaller` to the `requirements.txt`:

```diff
--- requirements.txt
+++ requirements.txt
@@ ... @@
+selenium==4.17.2
+chromedriver-autoinstaller==0.3.1
    # via -r requirements.in
```

By adding `chromedriver-autoinstaller`, the user can ensure that the correct version of ChromeDriver is installed and updated as needed, without manual intervention.
2024-02-07 16:29:45 -08:00
Paul Gauthier
ffdd034b7b Add new dependencies and update existing ones. 2024-02-07 16:29:44 -08:00
Paul Gauthier
29fa63cd3b aider: Add bs4 to requirements.
# Aider chat conversation:

USER: add bs4

ASSISTANT: ```diff
--- requirements.in
+++ requirements.in
@@ ... @@
 soundfile
+bs4
 PyYAML
 Pillow
```
2024-02-07 16:13:09 -08:00
Paul Gauthier
aa54fdd5f7 aider: Add the missing requirements for selenium and related packages.
# Aider chat conversation:

USER: add the missing requirements for selenium and related packages

ASSISTANT: ```diff
--- requirements.in
+++ requirements.in
@@ ... @@
+webdriver_manager
+selenium
```

The above diff adds the missing requirements for Selenium and the WebDriver Manager to the `requirements.in` file.
2024-02-07 14:23:24 -08:00
Paul Gauthier
ef12107bd7 aider: Add command /web to use headless selenium to scrape a webpage and add the content to the chat.
# Aider chat conversation:

USER: Add a command /web to use headless selenium to scrape a webpage and add the content to the chat

ASSISTANT: To implement the `/web` command, we will need to add a new method to the `Commands` class in the `aider/commands.py` file. This method will use Selenium with a headless browser to scrape the content of a given webpage and add the content to the chat.

Here are the diffs for the changes:

```diff
--- aider/commands.py
+++ aider/commands.py
@@ ... @@
 from pathlib import Path
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.chrome.service import Service
+from selenium.webdriver.common.by import By
+from webdriver_manager.chrome import ChromeDriverManager
 import re
 import subprocess
@@ ... @@
 class Commands:
     voice = None
+    web_driver = None

     def __init__(self, io, coder, voice_language=None):
         self.io = io
@@ ... @@
         self.voice_language = voice_language
         self.tokenizer = coder.main_model.tokenizer
+        self.initialize_web_driver()

+    def initialize_web_driver(self):
+        chrome_options = Options()
+        chrome_options.add_argument("--headless")
+        chrome_options.add_argument("--disable-gpu")
+        chrome_options.add_argument("--no-sandbox")
+        chrome_options.add_argument("--disable-dev-shm-usage")
+        self.web_driver = webdriver.Chrome(
+            service=Service(ChromeDriverManager().install()),
+            options=chrome_options
+        )
+
+    def cmd_web(self, args):
+        "Use headless selenium to scrape a webpage and add the content to the chat"
+        url = args.strip()
+        if not url:
+            self.io.tool_error("Please provide a URL to scrape.")
+            return
+
+        try:
+            self.web_driver.get(url)
+            page_content = self.web_driver.find_element(By.TAG_NAME, "body").text
+            self.io.tool_output(f"Content from {url}:\n{page_content}")
+            return page_content
+        except Exception as e:
+            self.io.tool_error(f"Error scraping {url}: {e}")
+
     def is_command(self, inp):
         return inp[0] in "/!"
@@ ... @@
```

This diff adds the necessary imports for Selenium and the WebDriver manager, initializes a headless Chrome WebDriver when the `Commands` class is instantiated, and adds a new command method `cmd_web` that takes a URL as an argument, scrapes the webpage, and outputs the content to the chat.
2024-02-07 14:22:02 -08:00
Paul Gauthier
8d9c2ba60b Added links to openai docs/pricing 2024-02-07 14:17:24 -08:00
Paul Gauthier
49f81ac784 ssh 2024-02-07 12:13:32 -08:00
Paul Gauthier
35f812859e Rationalized openai model attributes #458 2024-02-07 12:07:30 -08:00
Paul Gauthier
873a388693 Fixed the ! shortcut for /run 2024-02-06 16:14:27 -08:00
Paul Gauthier
926e393529 set version to 0.23.1-dev 2024-02-03 08:40:50 -08:00
Paul Gauthier
b0fd819241 version bump to 0.23.0 2024-02-03 08:40:17 -08:00
Paul Gauthier
174702babe Updated gpt-4 turbo model switches 2024-02-03 08:39:19 -08:00
Paul Gauthier
bca265abfc Updated HISTORY 2024-02-03 08:31:28 -08:00
Paul Gauthier
e2e2ee83a4 Added support for gpt-4-0125-preview 2024-02-03 08:27:31 -08:00
Paul Gauthier
f5887a5098 tweaking graph labels 2024-02-03 08:25:19 -08:00
Paul Gauthier
da14474cbe Updated HISTORY 2024-01-31 17:55:22 -08:00
Paul Gauthier
1317ee1ace Added --openrouter 2024-01-31 17:33:50 -08:00
Paul Gauthier
e79732dcf7 Added env_vars for all the OPENAI_ env vars 2024-01-31 15:26:53 -08:00
Paul Gauthier
885e33c2b3 copy 2024-01-26 13:51:03 -08:00
Paul Gauthier
edcf9b146b updated graph 2024-01-25 15:33:52 -08:00