Commit graph

61 commits

Author SHA1 Message Date
Paul Gauthier (aider)
3a96a10d06 style: Format code with black 2024-09-21 18:46:24 -07:00
Paul Gauthier (aider)
3dfc63ce79 feat: Add support for following redirects in httpx-based scraping 2024-09-21 18:46:21 -07:00
Paul Gauthier
8172b7be4b move imports into method 2024-09-03 08:05:21 -07:00
Paul Gauthier (aider)
7b336c9eb4 style: Reorder imports in scrape.py 2024-09-03 08:04:12 -07:00
Paul Gauthier (aider)
58abad72cd refactor: update Playwright error handling 2024-09-03 08:04:08 -07:00
Paul Gauthier
ef4a9dc4ca feat: add error handling for pypandoc conversion in Scraper class 2024-09-03 08:01:45 -07:00
Paul Gauthier (aider)
bdded55e17 fix: handle potential None value in content-type header 2024-08-29 13:43:29 -07:00
Paul Gauthier (aider)
5cab55c74b style: format code with linter 2024-08-12 09:54:06 -07:00
Paul Gauthier (aider)
2f4dd04164 feat: Add HTML content detection to scrape method 2024-08-12 09:54:03 -07:00
Paul Gauthier (aider)
ec63642666 style: Format code with linter 2024-08-12 09:51:04 -07:00
Paul Gauthier (aider)
55b7089766 fix: Handle UnboundLocalError in scrape_with_playwright 2024-08-12 09:51:01 -07:00
Paul Gauthier (aider)
2cc4ae6e88 style: Apply linter formatting changes 2024-08-10 06:00:41 -07:00
Paul Gauthier (aider)
dfe2359a86 feat: Implement MIME type detection in scrape methods 2024-08-10 06:00:38 -07:00
Paul Gauthier (aider)
c0982af02c feat: Modify scrape method to only convert HTML to markdown 2024-08-10 04:55:11 -07:00
Paul Gauthier
e1a9fd69e6 Implement playwright installation with dependencies and use system python executable. 2024-07-31 08:53:21 -03:00
Paul Gauthier (aider)
0f2aa62e80 Handle SSL certificate errors in the Playwright-based web scraper 2024-07-28 16:35:00 -03:00
Paul Gauthier (aider)
5dc3bbb6fb Catch and report errors when scraping web pages with Playwright, without crashing the application. 2024-07-25 20:24:32 +02:00
Paul Gauthier
f7ce78bc87 show install text with output not error 2024-07-23 12:02:35 +02:00
Paul Gauthier
1a345a4036 Removed the ignore_https_errors option when launching the Playwright browser. 2024-07-23 11:39:00 +02:00
Paul Gauthier
d164c85426 Improved error handling in scrape.py by converting exception to string before printing. 2024-07-23 11:38:33 +02:00
Paul Gauthier (aider)
97e51e60fc Implemented SSL certificate verification option in the Scraper class. 2024-07-22 15:18:47 +02:00
Paul Gauthier
c076c134ac use html source if pandoc NOT available 2024-07-18 10:03:04 +01:00
Paul Gauthier
88214f963b return html if pandoc is not available 2024-07-18 10:01:50 +01:00
Paul Gauthier
903faa8fef Catch errors when installing pandoc 2024-07-18 09:58:47 +01:00
Paul Gauthier
e9b3c13569 cleanup 2024-07-16 11:42:17 +01:00
Paul Gauthier
c5d93d7f0c removed get_playwright_instructions 2024-07-14 20:04:27 +01:00
Paul Gauthier
4fbe3d295a added [playwright] extra 2024-07-14 19:34:48 +01:00
Paul Gauthier
d9236d7684 wip 2024-07-13 07:48:28 +01:00
Paul Gauthier
644ec6f964 make test for playwright more robust #791 2024-07-03 21:37:05 -03:00
Paul Gauthier
ed35af44b3 defer numpy, bs4 and jsonschema 2024-07-03 13:35:33 -03:00
Paul Gauthier
2dc6735ab4 defer import of httpx 2024-07-03 13:25:10 -03:00
paul-gauthier
fab14fcd8b
Merge pull request #671 from nicolasperez19/fix-url-playwright-typo 2024-06-17 13:56:16 -07:00
Paul Gauthier
5c87008bd0 Updated HISTORY 2024-06-16 12:16:19 -07:00
Nicolas Perez
abeb9f4d84 fix: enable_playwright url had a typo 2024-06-12 02:32:50 -04:00
Paul Gauthier
dece830cb8 wait up to 5sec 2024-06-11 06:37:41 -07:00
Paul Gauthier
4ebc35e038 aider: Set a maximum timeout of 5 seconds for page.goto in the scrape_with_playwright method.
# Aider chat conversation:

USER: set a max timeout of 5sec for page.goto

ASSISTANT: To set a maximum timeout of 5 seconds for `page.goto`, we need to modify the `scrape_with_playwright` method in `aider/scrape.py`.

Here is the *SEARCH/REPLACE* block:

aider/scrape.py
```python
<<<<<<< SEARCH
            page.goto(url, wait_until="networkidle")
=======
            page.goto(url, wait_until="networkidle", timeout=5000)
>>>>>>> REPLACE
```
2024-06-11 06:36:35 -07:00
Paul Gauthier
337891e739 aider: Modified scrape_with_playwright method to wait for network to be idle before extracting content.
# Aider chat conversation:

USER: with playwright enabled, i get: You need to enable JavaScript to run this app.

ASSISTANT: To address the issue of JavaScript not being enabled, we need to ensure that Playwright is properly executing JavaScript on the page. This can be done by waiting for the page to load completely and ensuring that JavaScript is executed.

We will modify the `scrape_with_playwright` method to wait for the network to be idle and ensure that the page is fully loaded before extracting the content.

Here are the changes:

aider/scrape.py
```python
<<<<<<< SEARCH
            page.goto(url)
=======
            page.goto(url, wait_until="networkidle")
>>>>>>> REPLACE
```

This change will ensure that Playwright waits for the network to be idle, which typically means that all JavaScript has been executed and the page is fully loaded.
2024-06-11 06:35:00 -07:00
Paul Gauthier
a17b81236c Add content dumping functionality to the Scraper class. 2024-06-11 06:34:58 -07:00
Paul Gauthier
1f1da4b1a2 refac all urls into urls.py 2024-06-08 12:53:54 -07:00
Paul Gauthier
0e5342fdb8 copy 2024-06-06 11:01:27 -07:00
Paul Gauthier
b8313c5343 added docstrings 2024-05-01 15:14:14 -07:00
Paul Gauthier
dcb6100ce9 Add web page 2024-04-27 15:28:08 -07:00
Paul Gauthier
0fa2505ac5 Delete pandoc installer 2024-02-10 08:48:22 -08:00
Paul Gauthier
6ddfc894e7 Updated HISTORY 2024-02-10 07:31:04 -08:00
Paul Gauthier
bdef4308fe Simpler calls to pypandoc 2024-02-08 16:14:17 -08:00
Paul Gauthier
efff174f9a Use download_pandoc, which works everywhere including arm64 2024-02-08 15:56:00 -08:00
Paul Gauthier
2dee76378b keep hrefs 2024-02-08 15:19:00 -08:00
Paul Gauthier
05c74202ec copy 2024-02-08 14:20:49 -08:00
Paul Gauthier
cc36329691 output with pandoc, cleanup with bs and re 2024-02-08 14:14:42 -08:00
Paul Gauthier
be60b785a4 aider: Remove all data: URLs from slimdown_html.
# Aider chat conversation:

USER: in `slimdown_html` remove all data: urls

ASSISTANT: ```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def slimdown_html(soup):
     # Remove all <img> tags
     for img in soup.find_all('img'):
         img.decompose()
+    # Remove all elements with data: URLs
+    for tag in soup.find_all(href=lambda x: x and x.startswith('data:')):
+        tag.decompose()
+    for tag in soup.find_all(src=lambda x: x and x.startswith('data:')):
+        tag.decompose()
     # Remove all per-element CSS styles
     for tag in soup.find_all(True):
         tag.attrs.pop('style', None)
     # Remove all internal anchor elements
     for anchor in soup.find_all('a', href=True):
         if anchor['href'].startswith('#'):
             anchor.decompose()
     return soup
```
2024-02-08 12:12:03 -08:00