aider

mirror of https://github.com/Aider-AI/aider.git synced 2025-05-31 09:44:59 +00:00

Author	SHA1	Message	Date
Paul Gauthier (aider)	3a96a10d06	style: Format code with black	2024-09-21 18:46:24 -07:00
Paul Gauthier (aider)	3dfc63ce79	feat: Add support for following redirects in httpx-based scraping	2024-09-21 18:46:21 -07:00
Paul Gauthier	8172b7be4b	move imports into method	2024-09-03 08:05:21 -07:00
Paul Gauthier (aider)	7b336c9eb4	style: Reorder imports in scrape.py	2024-09-03 08:04:12 -07:00
Paul Gauthier (aider)	58abad72cd	refactor: update Playwright error handling	2024-09-03 08:04:08 -07:00
Paul Gauthier	ef4a9dc4ca	feat: add error handling for pypandoc conversion in Scraper class	2024-09-03 08:01:45 -07:00
Paul Gauthier (aider)	bdded55e17	fix: handle potential None value in content-type header	2024-08-29 13:43:29 -07:00
Paul Gauthier (aider)	5cab55c74b	style: format code with linter	2024-08-12 09:54:06 -07:00
Paul Gauthier (aider)	2f4dd04164	feat: Add HTML content detection to scrape method	2024-08-12 09:54:03 -07:00
Paul Gauthier (aider)	ec63642666	style: Format code with linter	2024-08-12 09:51:04 -07:00
Paul Gauthier (aider)	55b7089766	fix: Handle UnboundLocalError in scrape_with_playwright	2024-08-12 09:51:01 -07:00
Paul Gauthier (aider)	2cc4ae6e88	style: Apply linter formatting changes	2024-08-10 06:00:41 -07:00
Paul Gauthier (aider)	dfe2359a86	feat: Implement MIME type detection in scrape methods	2024-08-10 06:00:38 -07:00
Paul Gauthier (aider)	c0982af02c	feat: Modify scrape method to only convert HTML to markdown	2024-08-10 04:55:11 -07:00
Paul Gauthier	e1a9fd69e6	Implement playwright installation with dependencies and use system python executable.	2024-07-31 08:53:21 -03:00
Paul Gauthier (aider)	0f2aa62e80	Handle SSL certificate errors in the Playwright-based web scraper	2024-07-28 16:35:00 -03:00
Paul Gauthier (aider)	5dc3bbb6fb	Catch and report errors when scraping web pages with Playwright, without crashing the application.	2024-07-25 20:24:32 +02:00
Paul Gauthier	f7ce78bc87	show install text with output not error	2024-07-23 12:02:35 +02:00
Paul Gauthier	1a345a4036	Removed the `ignore_https_errors` option when launching the Playwright browser.	2024-07-23 11:39:00 +02:00
Paul Gauthier	d164c85426	Improved error handling in scrape.py by converting exception to string before printing.	2024-07-23 11:38:33 +02:00
Paul Gauthier (aider)	97e51e60fc	Implemented SSL certificate verification option in the Scraper class.	2024-07-22 15:18:47 +02:00
Paul Gauthier	c076c134ac	use html source if pandoc NOT available	2024-07-18 10:03:04 +01:00
Paul Gauthier	88214f963b	return html if pandoc is not available	2024-07-18 10:01:50 +01:00
Paul Gauthier	903faa8fef	Catch errors when installing pandoc	2024-07-18 09:58:47 +01:00
Paul Gauthier	e9b3c13569	cleanup	2024-07-16 11:42:17 +01:00
Paul Gauthier	c5d93d7f0c	removed get_playwright_instructions	2024-07-14 20:04:27 +01:00
Paul Gauthier	4fbe3d295a	added [playwright] extra	2024-07-14 19:34:48 +01:00
Paul Gauthier	d9236d7684	wip	2024-07-13 07:48:28 +01:00
Paul Gauthier	644ec6f964	make test for playwright more robust #791	2024-07-03 21:37:05 -03:00
Paul Gauthier	ed35af44b3	defer numpy, bs4 and jsonschema	2024-07-03 13:35:33 -03:00
Paul Gauthier	2dc6735ab4	defer import of httpx	2024-07-03 13:25:10 -03:00
paul-gauthier	fab14fcd8b	Merge pull request #671 from nicolasperez19/fix-url-playwright-typo	2024-06-17 13:56:16 -07:00
Paul Gauthier	5c87008bd0	Updated HISTORY	2024-06-16 12:16:19 -07:00
Nicolas Perez	abeb9f4d84	fix: `enable_playwright` url had a typo	2024-06-12 02:32:50 -04:00
Paul Gauthier	dece830cb8	wait up to 5sec	2024-06-11 06:37:41 -07:00
Paul Gauthier	4ebc35e038	aider: Set a maximum timeout of 5 seconds for `page.goto` in the `scrape_with_playwright` method. # Aider chat conversation: USER: set a max timeout of 5sec for page.goto ASSISTANT: To set a maximum timeout of 5 seconds for `page.goto`, we need to modify the `scrape_with_playwright` method in `aider/scrape.py`. Here is the SEARCH/REPLACE block: aider/scrape.py ```python <<<<<<< SEARCH page.goto(url, wait_until="networkidle") ======= page.goto(url, wait_until="networkidle", timeout=5000) >>>>>>> REPLACE ```	2024-06-11 06:36:35 -07:00
Paul Gauthier	337891e739	aider: Modified `scrape_with_playwright` method to wait for network to be idle before extracting content. # Aider chat conversation: USER: with playwright enabled, i get: You need to enable JavaScript to run this app. ASSISTANT: To address the issue of JavaScript not being enabled, we need to ensure that Playwright is properly executing JavaScript on the page. This can be done by waiting for the page to load completely and ensuring that JavaScript is executed. We will modify the `scrape_with_playwright` method to wait for the network to be idle and ensure that the page is fully loaded before extracting the content. Here are the changes: aider/scrape.py ```python <<<<<<< SEARCH page.goto(url) ======= page.goto(url, wait_until="networkidle") >>>>>>> REPLACE ``` This change will ensure that Playwright waits for the network to be idle, which typically means that all JavaScript has been executed and the page is fully loaded.	2024-06-11 06:35:00 -07:00
Paul Gauthier	a17b81236c	Add content dumping functionality to the Scraper class.	2024-06-11 06:34:58 -07:00
Paul Gauthier	1f1da4b1a2	refac all urls into urls.py	2024-06-08 12:53:54 -07:00
Paul Gauthier	0e5342fdb8	copy	2024-06-06 11:01:27 -07:00
Paul Gauthier	b8313c5343	added docstrings	2024-05-01 15:14:14 -07:00
Paul Gauthier	dcb6100ce9	Add web page	2024-04-27 15:28:08 -07:00
Paul Gauthier	0fa2505ac5	Delete pandoc installer	2024-02-10 08:48:22 -08:00
Paul Gauthier	6ddfc894e7	Updated HISTORY	2024-02-10 07:31:04 -08:00
Paul Gauthier	bdef4308fe	Simpler calls to pypandoc	2024-02-08 16:14:17 -08:00
Paul Gauthier	efff174f9a	Use download_pandoc, which works everywhere including arm64	2024-02-08 15:56:00 -08:00
Paul Gauthier	2dee76378b	keep hrefs	2024-02-08 15:19:00 -08:00
Paul Gauthier	05c74202ec	copy	2024-02-08 14:20:49 -08:00
Paul Gauthier	cc36329691	output with pandoc, cleanup with bs and re	2024-02-08 14:14:42 -08:00
Paul Gauthier	be60b785a4	aider: Remove all data: URLs from slimdown_html. # Aider chat conversation: USER: in `slimdown_html` remove all data: urls ASSISTANT: ```diff --- aider/scrape.py +++ aider/scrape.py @@ ... @@ def slimdown_html(soup): # Remove all <img> tags for img in soup.find_all('img'): img.decompose() + # Remove all elements with data: URLs + for tag in soup.find_all(href=lambda x: x and x.startswith('data:')): + tag.decompose() + for tag in soup.find_all(src=lambda x: x and x.startswith('data:')): + tag.decompose() # Remove all per-element CSS styles for tag in soup.find_all(True): tag.attrs.pop('style', None) # Remove all internal anchor elements for anchor in soup.find_all('a', href=True): if anchor['href'].startswith('#'): anchor.decompose() return soup ```	2024-02-08 12:12:03 -08:00

1 2

61 commits