Paul Gauthier
d9e52e41ff
fix: Replace self.print_error with print for timeout message
2025-03-28 15:36:25 -10:00
Paul Gauthier (aider)
a038bc002a
feat: Include URL in page timeout warning message
2025-03-28 15:35:01 -10:00
Paul Gauthier (aider)
fa256eb1a7
feat: Change timeout error to warning and continue scraping
2025-03-28 15:34:18 -10:00
Paul Gauthier (aider)
3a96a10d06
style: Format code with black
2024-09-21 18:46:24 -07:00
Paul Gauthier (aider)
3dfc63ce79
feat: Add support for following redirects in httpx-based scraping
2024-09-21 18:46:21 -07:00
Paul Gauthier
8172b7be4b
move imports into method
2024-09-03 08:05:21 -07:00
Paul Gauthier (aider)
7b336c9eb4
style: Reorder imports in scrape.py
2024-09-03 08:04:12 -07:00
Paul Gauthier (aider)
58abad72cd
refactor: update Playwright error handling
2024-09-03 08:04:08 -07:00
Paul Gauthier
ef4a9dc4ca
feat: add error handling for pypandoc conversion in Scraper class
2024-09-03 08:01:45 -07:00
Paul Gauthier (aider)
bdded55e17
fix: handle potential None value in content-type header
2024-08-29 13:43:29 -07:00
Paul Gauthier (aider)
5cab55c74b
style: format code with linter
2024-08-12 09:54:06 -07:00
Paul Gauthier (aider)
2f4dd04164
feat: Add HTML content detection to scrape method
2024-08-12 09:54:03 -07:00
Paul Gauthier (aider)
ec63642666
style: Format code with linter
2024-08-12 09:51:04 -07:00
Paul Gauthier (aider)
55b7089766
fix: Handle UnboundLocalError in scrape_with_playwright
2024-08-12 09:51:01 -07:00
Paul Gauthier (aider)
2cc4ae6e88
style: Apply linter formatting changes
2024-08-10 06:00:41 -07:00
Paul Gauthier (aider)
dfe2359a86
feat: Implement MIME type detection in scrape methods
2024-08-10 06:00:38 -07:00
Paul Gauthier (aider)
c0982af02c
feat: Modify scrape method to only convert HTML to markdown
2024-08-10 04:55:11 -07:00
Paul Gauthier
e1a9fd69e6
Implement playwright installation with dependencies and use system python executable.
2024-07-31 08:53:21 -03:00
Paul Gauthier (aider)
0f2aa62e80
Handle SSL certificate errors in the Playwright-based web scraper
2024-07-28 16:35:00 -03:00
Paul Gauthier (aider)
5dc3bbb6fb
Catch and report errors when scraping web pages with Playwright, without crashing the application.
2024-07-25 20:24:32 +02:00
Paul Gauthier
f7ce78bc87
show install text with output not error
2024-07-23 12:02:35 +02:00
Paul Gauthier
1a345a4036
Removed the ignore_https_errors
option when launching the Playwright browser.
2024-07-23 11:39:00 +02:00
Paul Gauthier
d164c85426
Improved error handling in scrape.py by converting exception to string before printing.
2024-07-23 11:38:33 +02:00
Paul Gauthier (aider)
97e51e60fc
Implemented SSL certificate verification option in the Scraper class.
2024-07-22 15:18:47 +02:00
Paul Gauthier
c076c134ac
use html source if pandoc NOT available
2024-07-18 10:03:04 +01:00
Paul Gauthier
88214f963b
return html if pandoc is not available
2024-07-18 10:01:50 +01:00
Paul Gauthier
903faa8fef
Catch errors when installing pandoc
2024-07-18 09:58:47 +01:00
Paul Gauthier
e9b3c13569
cleanup
2024-07-16 11:42:17 +01:00
Paul Gauthier
c5d93d7f0c
removed get_playwright_instructions
2024-07-14 20:04:27 +01:00
Paul Gauthier
4fbe3d295a
added [playwright] extra
2024-07-14 19:34:48 +01:00
Paul Gauthier
d9236d7684
wip
2024-07-13 07:48:28 +01:00
Paul Gauthier
644ec6f964
make test for playwright more robust #791
2024-07-03 21:37:05 -03:00
Paul Gauthier
ed35af44b3
defer numpy, bs4 and jsonschema
2024-07-03 13:35:33 -03:00
Paul Gauthier
2dc6735ab4
defer import of httpx
2024-07-03 13:25:10 -03:00
paul-gauthier
fab14fcd8b
Merge pull request #671 from nicolasperez19/fix-url-playwright-typo
2024-06-17 13:56:16 -07:00
Paul Gauthier
5c87008bd0
Updated HISTORY
2024-06-16 12:16:19 -07:00
Nicolas Perez
abeb9f4d84
fix: enable_playwright
url had a typo
2024-06-12 02:32:50 -04:00
Paul Gauthier
dece830cb8
wait up to 5sec
2024-06-11 06:37:41 -07:00
Paul Gauthier
4ebc35e038
aider: Set a maximum timeout of 5 seconds for page.goto
in the scrape_with_playwright
method.
...
# Aider chat conversation:
USER: set a max timeout of 5sec for page.goto
ASSISTANT: To set a maximum timeout of 5 seconds for `page.goto`, we need to modify the `scrape_with_playwright` method in `aider/scrape.py`.
Here is the *SEARCH/REPLACE* block:
aider/scrape.py
```python
<<<<<<< SEARCH
page.goto(url, wait_until="networkidle")
=======
page.goto(url, wait_until="networkidle", timeout=5000)
>>>>>>> REPLACE
```
2024-06-11 06:36:35 -07:00
Paul Gauthier
337891e739
aider: Modified scrape_with_playwright
method to wait for network to be idle before extracting content.
...
# Aider chat conversation:
USER: with playwright enabled, i get: You need to enable JavaScript to run this app.
ASSISTANT: To address the issue of JavaScript not being enabled, we need to ensure that Playwright is properly executing JavaScript on the page. This can be done by waiting for the page to load completely and ensuring that JavaScript is executed.
We will modify the `scrape_with_playwright` method to wait for the network to be idle and ensure that the page is fully loaded before extracting the content.
Here are the changes:
aider/scrape.py
```python
<<<<<<< SEARCH
page.goto(url)
=======
page.goto(url, wait_until="networkidle")
>>>>>>> REPLACE
```
This change will ensure that Playwright waits for the network to be idle, which typically means that all JavaScript has been executed and the page is fully loaded.
2024-06-11 06:35:00 -07:00
Paul Gauthier
a17b81236c
Add content dumping functionality to the Scraper class.
2024-06-11 06:34:58 -07:00
Paul Gauthier
1f1da4b1a2
refac all urls into urls.py
2024-06-08 12:53:54 -07:00
Paul Gauthier
0e5342fdb8
copy
2024-06-06 11:01:27 -07:00
Paul Gauthier
b8313c5343
added docstrings
2024-05-01 15:14:14 -07:00
Paul Gauthier
dcb6100ce9
Add web page
2024-04-27 15:28:08 -07:00
Paul Gauthier
0fa2505ac5
Delete pandoc installer
2024-02-10 08:48:22 -08:00
Paul Gauthier
6ddfc894e7
Updated HISTORY
2024-02-10 07:31:04 -08:00
Paul Gauthier
bdef4308fe
Simpler calls to pypandoc
2024-02-08 16:14:17 -08:00
Paul Gauthier
efff174f9a
Use download_pandoc, which works everywhere including arm64
2024-02-08 15:56:00 -08:00
Paul Gauthier
2dee76378b
keep hrefs
2024-02-08 15:19:00 -08:00