Commit graph

33 commits

Author SHA1 Message Date
Paul Gauthier
644ec6f964 make test for playwright more robust #791 2024-07-03 21:37:05 -03:00
Paul Gauthier
ed35af44b3 defer numpy, bs4 and jsonschema 2024-07-03 13:35:33 -03:00
Paul Gauthier
2dc6735ab4 defer import of httpx 2024-07-03 13:25:10 -03:00
paul-gauthier
fab14fcd8b
Merge pull request #671 from nicolasperez19/fix-url-playwright-typo 2024-06-17 13:56:16 -07:00
Paul Gauthier
5c87008bd0 Updated HISTORY 2024-06-16 12:16:19 -07:00
Nicolas Perez
abeb9f4d84 fix: enable_playwright url had a typo 2024-06-12 02:32:50 -04:00
Paul Gauthier
dece830cb8 wait up to 5sec 2024-06-11 06:37:41 -07:00
Paul Gauthier
4ebc35e038 aider: Set a maximum timeout of 5 seconds for page.goto in the scrape_with_playwright method.
# Aider chat conversation:

USER: set a max timeout of 5sec for page.goto

ASSISTANT: To set a maximum timeout of 5 seconds for `page.goto`, we need to modify the `scrape_with_playwright` method in `aider/scrape.py`.

Here is the *SEARCH/REPLACE* block:

aider/scrape.py
```python
<<<<<<< SEARCH
            page.goto(url, wait_until="networkidle")
=======
            page.goto(url, wait_until="networkidle", timeout=5000)
>>>>>>> REPLACE
```
2024-06-11 06:36:35 -07:00
Paul Gauthier
337891e739 aider: Modified scrape_with_playwright method to wait for network to be idle before extracting content.
# Aider chat conversation:

USER: with playwright enabled, i get: You need to enable JavaScript to run this app.

ASSISTANT: To address the issue of JavaScript not being enabled, we need to ensure that Playwright is properly executing JavaScript on the page. This can be done by waiting for the page to load completely and ensuring that JavaScript is executed.

We will modify the `scrape_with_playwright` method to wait for the network to be idle and ensure that the page is fully loaded before extracting the content.

Here are the changes:

aider/scrape.py
```python
<<<<<<< SEARCH
            page.goto(url)
=======
            page.goto(url, wait_until="networkidle")
>>>>>>> REPLACE
```

This change will ensure that Playwright waits for the network to be idle, which typically means that all JavaScript has been executed and the page is fully loaded.
2024-06-11 06:35:00 -07:00
Paul Gauthier
a17b81236c Add content dumping functionality to the Scraper class. 2024-06-11 06:34:58 -07:00
Paul Gauthier
1f1da4b1a2 refac all urls into urls.py 2024-06-08 12:53:54 -07:00
Paul Gauthier
0e5342fdb8 copy 2024-06-06 11:01:27 -07:00
Paul Gauthier
b8313c5343 added docstrings 2024-05-01 15:14:14 -07:00
Paul Gauthier
dcb6100ce9 Add web page 2024-04-27 15:28:08 -07:00
Paul Gauthier
0fa2505ac5 Delete pandoc installer 2024-02-10 08:48:22 -08:00
Paul Gauthier
6ddfc894e7 Updated HISTORY 2024-02-10 07:31:04 -08:00
Paul Gauthier
bdef4308fe Simpler calls to pypandoc 2024-02-08 16:14:17 -08:00
Paul Gauthier
efff174f9a Use download_pandoc, which works everywhere including arm64 2024-02-08 15:56:00 -08:00
Paul Gauthier
2dee76378b keep hrefs 2024-02-08 15:19:00 -08:00
Paul Gauthier
05c74202ec copy 2024-02-08 14:20:49 -08:00
Paul Gauthier
cc36329691 output with pandoc, cleanup with bs and re 2024-02-08 14:14:42 -08:00
Paul Gauthier
be60b785a4 aider: Remove all data: URLs from slimdown_html.
# Aider chat conversation:

USER: in `slimdown_html` remove all data: urls

ASSISTANT: ```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def slimdown_html(soup):
     # Remove all <img> tags
     for img in soup.find_all('img'):
         img.decompose()
+    # Remove all elements with data: URLs
+    for tag in soup.find_all(href=lambda x: x and x.startswith('data:')):
+        tag.decompose()
+    for tag in soup.find_all(src=lambda x: x and x.startswith('data:')):
+        tag.decompose()
     # Remove all per-element CSS styles
     for tag in soup.find_all(True):
         tag.attrs.pop('style', None)
     # Remove all internal anchor elements
     for anchor in soup.find_all('a', href=True):
         if anchor['href'].startswith('#'):
             anchor.decompose()
     return soup
```
2024-02-08 12:12:03 -08:00
Paul Gauthier
84ae51c005 Add functionality to write scraped content to a temporary file and optimize HTML parsing. 2024-02-08 12:11:59 -08:00
Paul Gauthier
33a62c8dc6 aider: finish slimdown_html. remove all <img> tags and all per-element css-styles and all internal anchor elements
# Aider chat conversation:

USER: finish `slimdown_html`. remove all <img> tags and all per-element css-styles and all internal anchor elements

ASSISTANT: ```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
-def slimdown_html(page_source: str) -> str:
-    soup = BeautifulSoup(page_source, "html.parser")
-    # ...
+def slimdown_html(page_source: str) -> str:
+    soup = BeautifulSoup(page_source, "html.parser")
+    # Remove all <img> tags
+    for img in soup.find_all('img'):
+        img.decompose()
+    # Remove all per-element CSS styles
+    for tag in soup.find_all(True):
+        tag.attrs.pop('style', None)
+    # Remove all internal anchor elements
+    for anchor in soup.find_all('a', href=True):
+        if anchor['href'].startswith('#'):
+            anchor.decompose()
+    return str(soup)
```
2024-02-08 12:10:12 -08:00
Paul Gauthier
1520bb976a Change html_to_text to html_to_markdown and add slimdown_html function. 2024-02-08 12:10:10 -08:00
Paul Gauthier
1509e21698 aider: Implemented html_to_markdown using pypandoc.
# Aider chat conversation:

USER: use pypandoc to implement `html_to_markdown`

ASSISTANT: To implement `html_to_markdown` using `pypandoc`, I will:

1. Add an import for `pypandoc`.
2. Implement the `html_to_markdown` function using `pypandoc.convert_text`.

Here are the diffs for those changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 import sys
+import pypandoc
 import httpx
 from bs4 import BeautifulSoup
 from playwright.sync_api import sync_playwright
@@ ... @@
 def html_to_markdown(page_source: str) -> str:
-    pass
+    return pypandoc.convert_text(page_source, 'md', format='html')
```

Please ensure that `pypandoc` is installed in your environment to use this implementation. If it's not installed, you can install it using `pip install pypandoc`.
2024-02-08 12:02:45 -08:00
Paul Gauthier
75e7f2bd4e Add html_to_markdown function to convert HTML to Markdown. 2024-02-08 12:02:42 -08:00
Paul Gauthier
681f26d010 Print playwright instructions after the content is displayed, so they are not lost 2024-02-08 12:01:18 -08:00
Paul Gauthier
5b78d929a6 Roughly working scraper 2024-02-08 11:44:55 -08:00
Paul Gauthier
9bf3a6e0c6 aider: Implemented HTTP request using httpx library in scrape_with_httpx method.
# Aider chat conversation:

USER: finish `scrape_with_httpx`

ASSISTANT: To finish the `scrape_with_httpx` method, I will implement the HTTP request using the `httpx` library. This will involve making a GET request to the provided URL and returning the content of the response.

Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
     def scrape_with_httpx(self, url):
-        pass
+        import httpx
+        headers = {
+            'User-Agent': aider_user_agent
+        }
+        try:
+            with httpx.Client(headers=headers) as client:
+                response = client.get(url)
+                response.raise_for_status()
+                return response.text
+        except httpx.HTTPError as http_err:
+            self.print_error(f'HTTP error occurred: {http_err}')
+        except Exception as err:
+            self.print_error(f'An error occurred: {err}')
+        return None
```

This implementation uses `httpx.Client` to send a GET request with a custom User-Agent header. It handles any HTTP errors by printing them out and returns `None` in case of an error. If the request is successful, it returns the text content of the response.
2024-02-08 11:29:46 -08:00
Paul Gauthier
e04187d1ad Refactor scraper to use Playwright for web scraping and handle user agent string. 2024-02-08 11:29:44 -08:00
Paul Gauthier
f10a9d2e7b aider: Retrieve the user agent from the browser context after creating a new page in the scrape_with_playwright function.
# Aider chat conversation:

USER: determine the user_agent

ASSISTANT: To determine the `user_agent` in the `scrape_with_playwright` function, we can use the `browser.new_context()` method which allows us to specify a `user_agent` string. We will set a common user agent string for a desktop browser.

Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def scrape_with_playwright(url):
     with sync_playwright()
^C KeyboardInterrupt

USER: I don't want to SET it. I want to determine what it is

ASSISTANT: To determine the `user_agent` being used by the browser in the `scrape_with_playwright` function, we can retrieve it from the browser context after creating a new page. Here is the diff for the changes:

```diff
--- aider/scrape.py
+++ aider/scrape.py
@@ ... @@
 def scrape_with_playwright(url):
     with sync_playwright() as p:
         try:
             browser = p.chromium.launch()
         except Exception as e:
             print(repr(e))
             return
+        page = browser.new_page()
+        user_agent = page.evaluate("navigator.userAgent")
+        print(f"User Agent: {user_agent}")
-        page = browser.new_page()
         page.goto(url)
         content = page.content()
         browser.close()

     return content
```
2024-02-08 11:02:54 -08:00
Paul Gauthier
65111591d7 Add scraper module for data extraction. 2024-02-08 11:02:52 -08:00