aider

mirror of https://github.com/Aider-AI/aider.git synced 2025-06-03 19:24:59 +00:00

Author	SHA1	Message	Date
Paul Gauthier	644ec6f964	make test for playwright more robust #791	2024-07-03 21:37:05 -03:00
Paul Gauthier	ed35af44b3	defer numpy, bs4 and jsonschema	2024-07-03 13:35:33 -03:00
Paul Gauthier	2dc6735ab4	defer import of httpx	2024-07-03 13:25:10 -03:00
paul-gauthier	fab14fcd8b	Merge pull request #671 from nicolasperez19/fix-url-playwright-typo	2024-06-17 13:56:16 -07:00
Paul Gauthier	5c87008bd0	Updated HISTORY	2024-06-16 12:16:19 -07:00
Nicolas Perez	abeb9f4d84	fix: `enable_playwright` url had a typo	2024-06-12 02:32:50 -04:00
Paul Gauthier	dece830cb8	wait up to 5sec	2024-06-11 06:37:41 -07:00
Paul Gauthier	4ebc35e038	aider: Set a maximum timeout of 5 seconds for `page.goto` in the `scrape_with_playwright` method. # Aider chat conversation: USER: set a max timeout of 5sec for page.goto ASSISTANT: To set a maximum timeout of 5 seconds for `page.goto`, we need to modify the `scrape_with_playwright` method in `aider/scrape.py`. Here is the SEARCH/REPLACE block: aider/scrape.py ```python <<<<<<< SEARCH page.goto(url, wait_until="networkidle") ======= page.goto(url, wait_until="networkidle", timeout=5000) >>>>>>> REPLACE ```	2024-06-11 06:36:35 -07:00
Paul Gauthier	337891e739	aider: Modified `scrape_with_playwright` method to wait for network to be idle before extracting content. # Aider chat conversation: USER: with playwright enabled, i get: You need to enable JavaScript to run this app. ASSISTANT: To address the issue of JavaScript not being enabled, we need to ensure that Playwright is properly executing JavaScript on the page. This can be done by waiting for the page to load completely and ensuring that JavaScript is executed. We will modify the `scrape_with_playwright` method to wait for the network to be idle and ensure that the page is fully loaded before extracting the content. Here are the changes: aider/scrape.py ```python <<<<<<< SEARCH page.goto(url) ======= page.goto(url, wait_until="networkidle") >>>>>>> REPLACE ``` This change will ensure that Playwright waits for the network to be idle, which typically means that all JavaScript has been executed and the page is fully loaded.	2024-06-11 06:35:00 -07:00
Paul Gauthier	a17b81236c	Add content dumping functionality to the Scraper class.	2024-06-11 06:34:58 -07:00
Paul Gauthier	1f1da4b1a2	refac all urls into urls.py	2024-06-08 12:53:54 -07:00
Paul Gauthier	0e5342fdb8	copy	2024-06-06 11:01:27 -07:00
Paul Gauthier	b8313c5343	added docstrings	2024-05-01 15:14:14 -07:00
Paul Gauthier	dcb6100ce9	Add web page	2024-04-27 15:28:08 -07:00
Paul Gauthier	0fa2505ac5	Delete pandoc installer	2024-02-10 08:48:22 -08:00
Paul Gauthier	6ddfc894e7	Updated HISTORY	2024-02-10 07:31:04 -08:00
Paul Gauthier	bdef4308fe	Simpler calls to pypandoc	2024-02-08 16:14:17 -08:00
Paul Gauthier	efff174f9a	Use download_pandoc, which works everywhere including arm64	2024-02-08 15:56:00 -08:00
Paul Gauthier	2dee76378b	keep hrefs	2024-02-08 15:19:00 -08:00
Paul Gauthier	05c74202ec	copy	2024-02-08 14:20:49 -08:00
Paul Gauthier	cc36329691	output with pandoc, cleanup with bs and re	2024-02-08 14:14:42 -08:00
Paul Gauthier	be60b785a4	aider: Remove all data: URLs from slimdown_html. # Aider chat conversation: USER: in `slimdown_html` remove all data: urls ASSISTANT: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ def slimdown_html(soup): # Remove all <img> tags for img in soup.find_all('img'): img.decompose() + # Remove all elements with data: URLs + for tag in soup.find_all(href=lambda x: x and x.startswith('data:')): + tag.decompose() + for tag in soup.find_all(src=lambda x: x and x.startswith('data:')): + tag.decompose() # Remove all per-element CSS styles for tag in soup.find_all(True): tag.attrs.pop('style', None) # Remove all internal anchor elements for anchor in soup.find_all('a', href=True): if anchor['href'].startswith('#'): anchor.decompose() return soup ```	2024-02-08 12:12:03 -08:00
Paul Gauthier	84ae51c005	Add functionality to write scraped content to a temporary file and optimize HTML parsing.	2024-02-08 12:11:59 -08:00
Paul Gauthier	33a62c8dc6	aider: finish `slimdown_html`. remove all <img> tags and all per-element css-styles and all internal anchor elements # Aider chat conversation: USER: finish `slimdown_html`. remove all <img> tags and all per-element css-styles and all internal anchor elements ASSISTANT: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ -def slimdown_html(page_source: str) -> str: - soup = BeautifulSoup(page_source, "html.parser") - # ... +def slimdown_html(page_source: str) -> str: + soup = BeautifulSoup(page_source, "html.parser") + # Remove all <img> tags + for img in soup.find_all('img'): + img.decompose() + # Remove all per-element CSS styles + for tag in soup.find_all(True): + tag.attrs.pop('style', None) + # Remove all internal anchor elements + for anchor in soup.find_all('a', href=True): + if anchor['href'].startswith('#'): + anchor.decompose() + return str(soup) ```	2024-02-08 12:10:12 -08:00
Paul Gauthier	1520bb976a	Change html_to_text to html_to_markdown and add slimdown_html function.	2024-02-08 12:10:10 -08:00
Paul Gauthier	1509e21698	aider: Implemented `html_to_markdown` using `pypandoc`. # Aider chat conversation: USER: use pypandoc to implement `html_to_markdown` ASSISTANT: To implement `html_to_markdown` using `pypandoc`, I will: 1. Add an import for `pypandoc`. 2. Implement the `html_to_markdown` function using `pypandoc.convert_text`. Here are the diffs for those changes: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ import sys +import pypandoc import httpx from bs4 import BeautifulSoup from playwright.sync_api import sync_playwright @@ ... @@ def html_to_markdown(page_source: str) -> str: - pass + return pypandoc.convert_text(page_source, 'md', format='html') ``` Please ensure that `pypandoc` is installed in your environment to use this implementation. If it's not installed, you can install it using `pip install pypandoc`.	2024-02-08 12:02:45 -08:00
Paul Gauthier	75e7f2bd4e	Add html_to_markdown function to convert HTML to Markdown.	2024-02-08 12:02:42 -08:00
Paul Gauthier	681f26d010	Print playwright instructions after the content is displayed, so they are not lost	2024-02-08 12:01:18 -08:00
Paul Gauthier	5b78d929a6	Roughly working scraper	2024-02-08 11:44:55 -08:00
Paul Gauthier	9bf3a6e0c6	aider: Implemented HTTP request using `httpx` library in `scrape_with_httpx` method. # Aider chat conversation: USER: finish `scrape_with_httpx` ASSISTANT: To finish the `scrape_with_httpx` method, I will implement the HTTP request using the `httpx` library. This will involve making a GET request to the provided URL and returning the content of the response. Here is the diff for the changes: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ def scrape_with_httpx(self, url): - pass + import httpx + headers = { + 'User-Agent': aider_user_agent + } + try: + with httpx.Client(headers=headers) as client: + response = client.get(url) + response.raise_for_status() + return response.text + except httpx.HTTPError as http_err: + self.print_error(f'HTTP error occurred: {http_err}') + except Exception as err: + self.print_error(f'An error occurred: {err}') + return None ``` This implementation uses `httpx.Client` to send a GET request with a custom User-Agent header. It handles any HTTP errors by printing them out and returns `None` in case of an error. If the request is successful, it returns the text content of the response.	2024-02-08 11:29:46 -08:00
Paul Gauthier	e04187d1ad	Refactor scraper to use Playwright for web scraping and handle user agent string.	2024-02-08 11:29:44 -08:00
Paul Gauthier	f10a9d2e7b	aider: Retrieve the user agent from the browser context after creating a new page in the `scrape_with_playwright` function. # Aider chat conversation: USER: determine the user_agent ASSISTANT: To determine the `user_agent` in the `scrape_with_playwright` function, we can use the `browser.new_context()` method which allows us to specify a `user_agent` string. We will set a common user agent string for a desktop browser. Here is the diff for the changes: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ def scrape_with_playwright(url): with sync_playwright() ^C KeyboardInterrupt USER: I don't want to SET it. I want to determine what it is ASSISTANT: To determine the `user_agent` being used by the browser in the `scrape_with_playwright` function, we can retrieve it from the browser context after creating a new page. Here is the diff for the changes: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ def scrape_with_playwright(url): with sync_playwright() as p: try: browser = p.chromium.launch() except Exception as e: print(repr(e)) return + page = browser.new_page() + user_agent = page.evaluate("navigator.userAgent") + print(f"User Agent: {user_agent}") - page = browser.new_page() page.goto(url) content = page.content() browser.close() return content ```	2024-02-08 11:02:54 -08:00
Paul Gauthier	65111591d7	Add scraper module for data extraction.	2024-02-08 11:02:52 -08:00

33 commits