diff --git a/README.md b/README.md index d40bc59..1bb23d8 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ The documentation will be available at `http://localhost:3000`. ├── introduction.mdx # Main introduction page ├── services/ # Core services documentation │ ├── smartscraper.mdx # SmartScraper service -│ ├── localscraper.mdx # LocalScraper service +│ ├── searchscraper.mdx # SearchScraper service │ ├── markdownify.mdx # Markdownify service │ └── extensions/ # Browser extensions │ └── firefox.mdx # Firefox extension diff --git a/api-reference/endpoint/localscraper/get-status.mdx b/api-reference/endpoint/localscraper/get-status.mdx deleted file mode 100644 index 83d2e61..0000000 --- a/api-reference/endpoint/localscraper/get-status.mdx +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: 'Get LocalScraper Status' -openapi: 'GET /v1/localscraper/{request_id}' -description: 'Check the status and retrieve results of a LocalScraper request.' ---- - -This endpoint allows you to check the status of a LocalScraper request and retrieve its results once completed. - -### Status Values -- `queued`: Request is waiting to be processed -- `processing`: Request is being processed -- `completed`: Request has finished successfully -- `failed`: Request failed to process diff --git a/api-reference/endpoint/localscraper/start.mdx b/api-reference/endpoint/localscraper/start.mdx deleted file mode 100644 index 11be89b..0000000 --- a/api-reference/endpoint/localscraper/start.mdx +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: 'Start LocalScraper' -openapi: 'POST /v1/localscraper' -description: 'Extract content from HTML content using AI by providing a natural language prompt and the HTML content.' ---- - -LocalScraper works similarly to SmartScraper but accepts HTML content directly instead of a URL. This is useful when you already have the HTML content or need to scrape content that requires authentication. - -### Key Features -- Process HTML content directly -- No URL required -- Supports large HTML files (up to 2MB) -- Same AI-powered extraction capabilities as SmartScraper - -### Example Response - -```json -{ - "request_id": "", - "status": "completed", - "user_prompt": "Extract all product prices and names", - "result": { - "products": [ - { - "name": "iPhone 15 Pro", - "price": "$999.99", - "availability": "In Stock" - }, - { - "name": "MacBook Air M2", - "price": "$1,299.00", - "availability": "Pre-order" - }, - { - "name": "AirPods Pro", - "price": "$249.99", - "availability": "In Stock" - }, - { - "name": "iPad Air", - "price": "$599.00", - "availability": "Out of Stock" - } - ] - }, - "error": "" -} -``` \ No newline at end of file diff --git a/api-reference/endpoint/searchscraper/get-status.mdx b/api-reference/endpoint/searchscraper/get-status.mdx new file mode 100644 index 0000000..f9dbc11 --- /dev/null +++ b/api-reference/endpoint/searchscraper/get-status.mdx @@ -0,0 +1,93 @@ +--- +title: 'Get SearchScraper Status' +api: 'GET /v1/searchscraper/{request_id}' +description: 'Get the status and results of a previous search request' +--- + +## Path Parameters + + + The unique identifier of the search request to retrieve. + + Example: "123e4567-e89b-12d3-a456-426614174000" + + +## Response + + + The unique identifier of the search request. + + + + Status of the request. One of: "queued", "processing", "completed", "failed" + + + + The original search query that was submitted. + + + + The search results. If an output_schema was provided in the original request, this will be structured according to that schema. + + + + List of URLs that were used as references for the answer. + + + + Error message if the request failed. Empty string if successful. + + +## Example Request + +```bash +curl 'https://api.scrapegraphai.com/v1/searchscraper/123e4567-e89b-12d3-a456-426614174000' \ +-H 'SGAI-APIKEY: YOUR_API_KEY' +``` + +## Example Response + +```json +{ + "request_id": "123e4567-e89b-12d3-a456-426614174000", + "status": "completed", + "user_prompt": "What is the latest version of Python and what are its main features?", + "result": { + "version": "3.12", + "release_date": "October 2, 2023", + "major_features": [ + "Improved error messages", + "Per-interpreter GIL", + "Support for the Linux perf profiler", + "Faster startup time" + ] + }, + "reference_urls": [ + "https://www.python.org/downloads/", + "https://docs.python.org/3.12/whatsnew/3.12.html" + ], + "error": "" +} +``` + +## Error Responses + + + Returned when the request_id is not a valid UUID. + + ```json + { + "error": "request_id must be a valid UUID" + } + ``` + + + + Returned when the request_id is not found. + + ```json + { + "error": "Request not found" + } + ``` + \ No newline at end of file diff --git a/api-reference/endpoint/searchscraper/start.mdx b/api-reference/endpoint/searchscraper/start.mdx new file mode 100644 index 0000000..c0dae2e --- /dev/null +++ b/api-reference/endpoint/searchscraper/start.mdx @@ -0,0 +1,111 @@ +--- +title: 'Start SearchScraper' +api: 'POST /v1/searchscraper' +description: 'Start a new AI-powered web search request' +--- + +## Request Body + + + The search query or question you want to ask. This should be a clear and specific prompt that will guide the AI in finding and extracting relevant information. + + Example: "What is the latest version of Python and what are its main features?" + + + + Optional headers to customize the search behavior. This can include user agent, cookies, or other HTTP headers. + + Example: + ```json + { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Cookie": "cookie1=value1; cookie2=value2" + } + ``` + + + + Optional schema to structure the output. If provided, the AI will attempt to format the results according to this schema. + + Example: + ```json + { + "properties": { + "version": {"type": "string"}, + "release_date": {"type": "string"}, + "major_features": {"type": "array", "items": {"type": "string"}} + }, + "required": ["version", "release_date", "major_features"] + } + ``` + + +## Response + + + Unique identifier for the search request. Use this ID to check the status and retrieve results. + + + + Status of the request. One of: "queued", "processing", "completed", "failed" + + + + The original search query that was submitted. + + + + The search results. If an output_schema was provided, this will be structured according to that schema. + + + + List of URLs that were used as references for the answer. + + + + Error message if the request failed. Empty string if successful. + + +## Example Request + +```bash +curl -X POST 'https://api.scrapegraphai.com/v1/searchscraper' \ +-H 'SGAI-APIKEY: YOUR_API_KEY' \ +-H 'Content-Type: application/json' \ +-d '{ + "user_prompt": "What is the latest version of Python and what are its main features?", + "output_schema": { + "properties": { + "version": {"type": "string"}, + "release_date": {"type": "string"}, + "major_features": {"type": "array", "items": {"type": "string"}} + }, + "required": ["version", "release_date", "major_features"] + } +}' +``` + +## Example Response + +```json +{ + "request_id": "123e4567-e89b-12d3-a456-426614174000", + "status": "completed", + "user_prompt": "What is the latest version of Python and what are its main features?", + "result": { + "version": "3.12", + "release_date": "October 2, 2023", + "major_features": [ + "Improved error messages", + "Per-interpreter GIL", + "Support for the Linux perf profiler", + "Faster startup time" + ] + }, + "reference_urls": [ + "https://www.python.org/downloads/", + "https://docs.python.org/3.12/whatsnew/3.12.html" + ], + "error": "" +} +``` \ No newline at end of file diff --git a/api-reference/endpoint/user/get-credits.mdx b/api-reference/endpoint/user/get-credits.mdx index 0cff846..91befb0 100644 --- a/api-reference/endpoint/user/get-credits.mdx +++ b/api-reference/endpoint/user/get-credits.mdx @@ -5,9 +5,9 @@ description: 'Get the remaining credits and total credits used for your account. --- This endpoint allows you to check your account's credit balance and usage. Each API request consumes a different number of credits: -- Markdownify: 2 credits per webpage -- SmartScraper: 5 credits per webpage -- LocalScraper: 10 credits per webpage +- Markdownify: 2 credits per request +- SmartScraper: 10 credits per request +- SearchScraper: 30 credits per request The response shows: - `remaining_credits`: Number of credits available for use diff --git a/api-reference/endpoint/user/submit-feedback.mdx b/api-reference/endpoint/user/submit-feedback.mdx index 5d84e65..06ec50a 100644 --- a/api-reference/endpoint/user/submit-feedback.mdx +++ b/api-reference/endpoint/user/submit-feedback.mdx @@ -4,7 +4,7 @@ openapi: 'POST /v1/feedback' description: 'Submit feedback for a specific request with rating and optional comments.' --- -This endpoint allows you to submit feedback for any request you've made using our services (SmartScraper, LocalScraper, or Markdownify). Your feedback helps us improve our services. +This endpoint allows you to submit feedback for any request you've made using our services (SmartScraper, SearchScraper, or Markdownify). Your feedback helps us improve our services. ### Rating System - Rating scale: 0-5 stars diff --git a/api-reference/errors.mdx b/api-reference/errors.mdx index 1a05944..b1245d1 100644 --- a/api-reference/errors.mdx +++ b/api-reference/errors.mdx @@ -43,7 +43,7 @@ Indicates that the request was malformed or invalid. "error": "Invalid HTML content" } ``` - Applies to LocalScraper when the provided HTML is invalid. + Applies to SmartScraper when the provided HTML is invalid. diff --git a/api-reference/introduction.mdx b/api-reference/introduction.mdx index 85444d5..bee4eb5 100644 --- a/api-reference/introduction.mdx +++ b/api-reference/introduction.mdx @@ -5,7 +5,7 @@ description: 'Complete reference for the ScrapeGraphAI REST API' ## Overview -The ScrapeGraphAI API provides powerful endpoints for AI-powered web scraping and content extraction. Our RESTful API allows you to extract structured data from any website, process local HTML content, and convert web pages to clean markdown. +The ScrapeGraphAI API provides powerful endpoints for AI-powered web scraping and content extraction. Our RESTful API allows you to extract structured data from any website, perform AI-powered web searches, and convert web pages to clean markdown. ## Authentication @@ -31,8 +31,8 @@ https://api.scrapegraphai.com/v1 Extract structured data from any website using AI - - Process local HTML content with AI extraction + + Perform AI-powered web searches with structured results Convert web content to clean markdown diff --git a/api-reference/openapi.json b/api-reference/openapi.json index 329e3a1..5bbc8cf 100644 --- a/api-reference/openapi.json +++ b/api-reference/openapi.json @@ -11,18 +11,18 @@ } ], "paths": { - "/v1/smartcrawler": { + "/v1/smartscraper": { "post": { "tags": [ - "SmartCrawler" + "SmartScraper" ], - "summary": "Start Smartcrawler", - "operationId": "start_smartcrawler_v1_smartcrawler_post", + "summary": "Start Smartscraper", + "operationId": "start_smartscraper_v1_smartscraper_post", "requestBody": { "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CrawlRequest" + "$ref": "#/components/schemas/ScrapeRequest" } } }, @@ -34,7 +34,7 @@ "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/InProgressCrawlSessionResponse" + "$ref": "#/components/schemas/CompletedSmartscraperResponse" } } } @@ -54,16 +54,33 @@ { "APIKeyHeader": [] } + ], + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl -X POST 'https://api.scrapegraphai.com/v1/smartscraper' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY' \\\n -H 'Content-Type: application/json' \\\n -d '{\n \"user_prompt\": \"Extract info about the company\",\n \"website_url\": \"https://scrapegraphai.com/\"\n }'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/smartscraper'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n}\npayload = {\n 'user_prompt': 'Extract info about the company',\n 'website_url': 'https://scrapegraphai.com/'\n}\n\nresponse = requests.post(url, json=payload, headers=headers)\ndata = response.json()" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/smartscraper';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n};\nconst payload = {\n user_prompt: 'Extract info about the company',\n website_url: 'https://scrapegraphai.com/'\n};\n\nfetch(url, {\n method: 'POST',\n headers: headers,\n body: JSON.stringify(payload)\n})\n .then(response => response.json())\n .then(data => console.log(data));" + } ] } }, - "/v1/smartcrawler/{session_id}": { + "/v1/smartscraper/{request_id}": { "get": { "tags": [ - "SmartCrawler" + "SmartScraper" ], - "summary": "Get Smartcrawler Status", - "operationId": "get_smartcrawler_status_v1_smartcrawler__session_id__get", + "summary": "Get Smartscraper Status", + "operationId": "get_smartscraper_status_v1_smartscraper__request_id__get", "security": [ { "APIKeyHeader": [] @@ -71,12 +88,12 @@ ], "parameters": [ { - "name": "session_id", + "name": "request_id", "in": "path", "required": true, "schema": { "type": "string", - "title": "Session Id" + "title": "Request Id" } } ], @@ -86,7 +103,7 @@ "content": { "application/json": { "schema": { - + "$ref": "#/components/schemas/CompletedSmartscraperResponse" } } } @@ -101,75 +118,38 @@ } } } - } - } - }, - "/v1/smartcrawler/sessions/all": { - "get": { - "tags": [ - "SmartCrawler" - ], - "summary": "Get All Sessions", - "operationId": "get_all_sessions_v1_smartcrawler_sessions_all_get", - "security": [ + }, + "x-codeSamples": [ { - "APIKeyHeader": [] - } - ], - "parameters": [ + "lang": "curl", + "label": "cURL", + "source": "curl 'https://api.scrapegraphai.com/v1/smartscraper/YOUR_REQUEST_ID' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY'" + }, { - "name": "status", - "in": "query", - "required": false, - "schema": { - "anyOf": [ - { - "type": "string" - }, - { - "type": "null" - } - ], - "title": "Status" - } - } - ], - "responses": { - "200": { - "description": "Successful Response", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/CrawlSessionsList" - } - } - } + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/smartscraper/YOUR_REQUEST_ID'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n}\n\nresponse = requests.get(url, headers=headers)\ndata = response.json()" }, - "422": { - "description": "Validation Error", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/HTTPValidationError" - } - } - } + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/smartscraper/YOUR_REQUEST_ID';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n};\n\nfetch(url, { headers })\n .then(response => response.json())\n .then(data => console.log(data));" } - } + ] } }, - "/v1/smartscraper": { + "/v1/markdownify": { "post": { "tags": [ - "SmartScraper" + "Markdownify" ], - "summary": "Start Smartscraper", - "operationId": "start_smartscraper_v1_smartscraper_post", + "summary": "Start Markdownify", + "operationId": "start_markdownify_v1_markdownify_post", "requestBody": { "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/ScrapeRequest" + "$ref": "#/components/schemas/MarkdownifyRequest" } } }, @@ -181,7 +161,7 @@ "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedSmartscraperResponse" + "$ref": "#/components/schemas/CompletedMarkdownifyResponse" } } } @@ -201,16 +181,33 @@ { "APIKeyHeader": [] } + ], + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl -X POST 'https://api.scrapegraphai.com/v1/markdownify' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY' \\\n -H 'Content-Type: application/json' \\\n -d '{\n \"website_url\": \"https://scrapegraphai.com/\"\n }'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/markdownify'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n}\npayload = {\n 'website_url': 'https://scrapegraphai.com/'\n}\n\nresponse = requests.post(url, json=payload, headers=headers)\ndata = response.json()" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/markdownify';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n};\nconst payload = {\n website_url: 'https://scrapegraphai.com/'\n};\n\nfetch(url, {\n method: 'POST',\n headers: headers,\n body: JSON.stringify(payload)\n})\n .then(response => response.json())\n .then(data => console.log(data));" + } ] } }, - "/v1/smartscraper/{request_id}": { + "/v1/markdownify/{request_id}": { "get": { "tags": [ - "SmartScraper" + "Markdownify" ], - "summary": "Get Smartscraper Status", - "operationId": "get_smartscraper_status_v1_smartscraper__request_id__get", + "summary": "Get Markdownify Status", + "operationId": "get_markdownify_status_v1_markdownify__request_id__get", "security": [ { "APIKeyHeader": [] @@ -233,7 +230,7 @@ "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedSmartscraperResponse" + "$ref": "#/components/schemas/CompletedMarkdownifyResponse" } } } @@ -248,43 +245,40 @@ } } } - } + }, + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl 'https://api.scrapegraphai.com/v1/markdownify/YOUR_REQUEST_ID' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/markdownify/YOUR_REQUEST_ID'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n}\n\nresponse = requests.get(url, headers=headers)\ndata = response.json()" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/markdownify/YOUR_REQUEST_ID';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n};\n\nfetch(url, { headers })\n .then(response => response.json())\n .then(data => console.log(data));" + } + ] } }, - "/v1/markdownify": { - "post": { + "/v1/credits": { + "get": { "tags": [ - "Markdownify" + "User" ], - "summary": "Start Markdownify", - "operationId": "start_markdownify_v1_markdownify_post", - "requestBody": { - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/MarkdownifyRequest" - } - } - }, - "required": true - }, + "summary": "Get Credits", + "operationId": "get_credits_v1_credits_get", "responses": { "200": { "description": "Successful Response", "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedMarkdownifyResponse" - } - } - } - }, - "422": { - "description": "Validation Error", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/HTTPValidationError" + "$ref": "#/components/schemas/CreditsResponse" } } } @@ -294,39 +288,50 @@ { "APIKeyHeader": [] } + ], + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl 'https://api.scrapegraphai.com/v1/credits' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/credits'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n}\n\nresponse = requests.get(url, headers=headers)\ndata = response.json()" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/credits';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n};\n\nfetch(url, { headers })\n .then(response => response.json())\n .then(data => console.log(data));" + } ] } }, - "/v1/markdownify/{request_id}": { - "get": { + "/v1/feedback": { + "post": { "tags": [ - "Markdownify" - ], - "summary": "Get Markdownify Status", - "operationId": "get_markdownify_status_v1_markdownify__request_id__get", - "security": [ - { - "APIKeyHeader": [] - } + "User" ], - "parameters": [ - { - "name": "request_id", - "in": "path", - "required": true, - "schema": { - "type": "string", - "title": "Request Id" + "summary": "Submit Feedback", + "operationId": "submit_feedback_v1_feedback_post", + "requestBody": { + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/FeedbackCreate" + } } - } - ], + }, + "required": true + }, "responses": { "200": { "description": "Successful Response", "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedMarkdownifyResponse" + "$ref": "#/components/schemas/FeedbackResponse" } } } @@ -341,21 +346,43 @@ } } } - } + }, + "security": [ + { + "APIKeyHeader": [] + } + ], + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl -X POST 'https://api.scrapegraphai.com/v1/feedback' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY' \\\n -H 'Content-Type: application/json' \\\n -d '{\n \"request_id\": \"123e4567-e89b-12d3-a456-426614174000\",\n \"rating\": 5,\n \"feedback_text\": \"Great service!\"\n }'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\n\nurl = 'https://api.scrapegraphai.com/v1/feedback'\nheaders = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n}\npayload = {\n 'request_id': '123e4567-e89b-12d3-a456-426614174000',\n 'rating': 5,\n 'feedback_text': 'Great service!'\n}\n\nresponse = requests.post(url, json=payload, headers=headers)\ndata = response.json()" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/feedback';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n};\nconst payload = {\n request_id: '123e4567-e89b-12d3-a456-426614174000',\n rating: 5,\n feedback_text: 'Great service!'\n};\n\nfetch(url, {\n method: 'POST',\n headers: headers,\n body: JSON.stringify(payload)\n})\n .then(response => response.json())\n .then(data => console.log(data));" + } + ] } }, - "/v1/localscraper": { + "/v1/searchscraper": { "post": { "tags": [ - "LocalScraper" + "SearchScraper" ], - "summary": "Start Localscraper", - "operationId": "start_localscraper_v1_localscraper_post", + "summary": "Start Searchscraper", + "operationId": "start_searchscraper_v1_searchscraper_post", "requestBody": { "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/LocalscraperRequest" + "$ref": "#/components/schemas/SearchScraperRequest" } } }, @@ -367,7 +394,7 @@ "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedLocalscraperResponse" + "$ref": "#/components/schemas/CompletedSearchScraperResponse" } } } @@ -387,16 +414,33 @@ { "APIKeyHeader": [] } + ], + "x-codeSamples": [ + { + "lang": "curl", + "label": "cURL", + "source": "curl -X POST 'https://api.scrapegraphai.com/v1/searchscraper' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY' \\\n -H 'Content-Type: application/json' \\\n -d '{\n \"user_prompt\": \"What is the latest version of Python?\",\n \"output_schema\": {\n \"answer\": \"string\",\n \"details\": {\n \"version\": \"string\",\n \"release_date\": \"string\",\n \"download_url\": \"string\"\n }\n }\n }'" + }, + { + "lang": "python", + "label": "Python", + "source": "import requests\nimport time\n\ndef search_and_wait(prompt, output_schema=None, max_retries=30, delay=2):\n \"\"\"Search for information and wait for results\n \n Args:\n prompt (str): Natural language query\n output_schema (dict, optional): Schema to structure the response\n max_retries (int): Maximum number of status checks\n delay (int): Seconds to wait between status checks\n \"\"\"\n url = 'https://api.scrapegraphai.com/v1/searchscraper'\n headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n }\n \n # Prepare the request payload\n payload = {\n 'user_prompt': prompt,\n 'headers': {\n 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'\n }\n }\n \n # Add output schema if provided\n if output_schema:\n payload['output_schema'] = output_schema\n \n # Start the search\n response = requests.post(url, json=payload, headers=headers)\n \n if response.status_code != 200:\n raise Exception(f'Failed to start search: {response.text}')\n \n data = response.json()\n request_id = data['request_id']\n print(f'Search started with request ID: {request_id}')\n \n # Poll for results\n for attempt in range(max_retries):\n status_url = f'{url}/{request_id}'\n status_response = requests.get(status_url, headers=headers)\n status_data = status_response.json()\n \n if status_data['status'] == 'completed':\n print('\\nSearch completed successfully!')\n print('\\nResults:')\n print('--------')\n print(f\"Answer: {status_data['result'].get('answer')}\")\n if 'details' in status_data['result']:\n print('\\nDetails:')\n for key, value in status_data['result']['details'].items():\n print(f'{key}: {value}')\n print('\\nReference URLs:')\n for url in status_data['reference_urls']:\n print(f'- {url}')\n return status_data\n \n elif status_data['status'] == 'failed':\n raise Exception(f'Search failed: {status_data.get(\"error\", \"Unknown error\")}')\n \n print(f'Status: {status_data[\"status\"]}... (attempt {attempt + 1}/{max_retries})', end='\\r')\n time.sleep(delay)\n \n raise TimeoutError(f'Search did not complete after {max_retries * delay} seconds')\n\n# Example usage\ntry:\n # Define the output schema for structured results\n schema = {\n 'answer': 'string',\n 'details': {\n 'version': 'string',\n 'release_date': 'string',\n 'download_url': 'string'\n }\n }\n \n # Run the search with the schema\n results = search_and_wait(\n prompt='What is the latest version of Python?',\n output_schema=schema\n )\n \n # Access specific fields from the structured response\n version = results['result']['details']['version']\n release_date = results['result']['details']['release_date']\n \n # Use the data in your application\n print(f'\\nPython {version} was released on {release_date}')\n \nexcept Exception as e:\n print(f'Error: {str(e)}')" + }, + { + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/searchscraper';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY',\n 'Content-Type': 'application/json'\n};\nconst payload = {\n user_prompt: 'What is the latest version of Python?',\n output_schema: {\n answer: 'string',\n details: {\n version: 'string',\n release_date: 'string',\n download_url: 'string'\n }\n }\n};\n\nfetch(url, {\n method: 'POST',\n headers: headers,\n body: JSON.stringify(payload)\n})\n .then(response => response.json())\n .then(data => console.log(data));" + } ] } }, - "/v1/localscraper/{request_id}": { + "/v1/searchscraper/{request_id}": { "get": { "tags": [ - "LocalScraper" + "SearchScraper" ], - "summary": "Get Localscraper Status", - "operationId": "get_localscraper_status_v1_localscraper__request_id__get", + "summary": "Get Searchscraper Status", + "operationId": "get_searchscraper_status_v1_searchscraper__request_id__get", "security": [ { "APIKeyHeader": [] @@ -419,7 +463,7 @@ "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/CompletedLocalscraperResponse" + "$ref": "#/components/schemas/CompletedSearchScraperResponse" } } } @@ -434,77 +478,22 @@ } } } - } - } - }, - "/v1/credits": { - "get": { - "tags": [ - "User" - ], - "summary": "Get Credits", - "operationId": "get_credits_v1_credits_get", - "responses": { - "200": { - "description": "Successful Response", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/CreditsResponse" - } - } - } - } }, - "security": [ + "x-codeSamples": [ { - "APIKeyHeader": [] - } - ] - } - }, - "/v1/feedback": { - "post": { - "tags": [ - "User" - ], - "summary": "Submit Feedback", - "operationId": "submit_feedback_v1_feedback_post", - "requestBody": { - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/FeedbackCreate" - } - } + "lang": "curl", + "label": "cURL", + "source": "curl 'https://api.scrapegraphai.com/v1/searchscraper/YOUR_REQUEST_ID' \\\n -H 'SGAI-APIKEY: YOUR_API_KEY'" }, - "required": true - }, - "responses": { - "200": { - "description": "Successful Response", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/FeedbackResponse" - } - } - } + { + "lang": "python", + "label": "Python", + "source": "import requests\n\ndef get_search_status(request_id):\n url = f'https://api.scrapegraphai.com/v1/searchscraper/{request_id}'\n headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n }\n\n response = requests.get(url, headers=headers)\n \n if response.status_code == 200:\n data = response.json()\n status = data['status']\n \n if status == 'completed':\n return {\n 'success': True,\n 'status': status,\n 'result': data['result'],\n 'reference_urls': data['reference_urls']\n }\n elif status == 'failed':\n return {\n 'success': False,\n 'status': status,\n 'error': data['error']\n }\n else:\n return {\n 'success': None,\n 'status': status\n }\n else:\n return {\n 'success': False,\n 'error': f'Request failed with status code: {response.status_code}'\n }\n\n# Example usage\nrequest_id = 'YOUR_REQUEST_ID'\nstatus = get_search_status(request_id)\n\nif status['success'] is True:\n print('Search completed successfully!')\n print('Results:', status['result'])\n print('Reference URLs:', status['reference_urls'])\nelif status['success'] is False:\n print('Search failed:', status.get('error', 'Unknown error'))\nelse:\n print(f'Search is {status[\"status\"]}...')" }, - "422": { - "description": "Validation Error", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/HTTPValidationError" - } - } - } - } - }, - "security": [ { - "APIKeyHeader": [] + "lang": "javascript", + "label": "JavaScript", + "source": "const url = 'https://api.scrapegraphai.com/v1/searchscraper/YOUR_REQUEST_ID';\nconst headers = {\n 'SGAI-APIKEY': 'YOUR_API_KEY'\n};\n\nfetch(url, { headers })\n .then(response => response.json())\n .then(data => console.log(data));" } ] } @@ -512,44 +501,6 @@ }, "components": { "schemas": { - "CompletedLocalscraperResponse": { - "properties": { - "request_id": { - "type": "string", - "title": "Request Id" - }, - "status": { - "$ref": "#/components/schemas/LocalscraperStatus" - }, - "user_prompt": { - "type": "string", - "title": "User Prompt" - }, - "result": { - "anyOf": [ - { - "type": "object" - }, - { - "type": "null" - } - ], - "title": "Result" - }, - "error": { - "type": "string", - "title": "Error", - "default": "" - } - }, - "type": "object", - "required": [ - "request_id", - "status", - "user_prompt" - ], - "title": "CompletedLocalscraperResponse" - }, "CompletedMarkdownifyResponse": { "properties": { "request_id": { @@ -631,103 +582,6 @@ ], "title": "CompletedSmartscraperResponse" }, - "CrawlRequest": { - "properties": { - "website_url": { - "type": "string", - "title": "Website Url", - "example": "https://scrapegraphai.com/" - }, - "input_queries": { - "items": { - "type": "string" - }, - "type": "array", - "maxItems": 50, - "title": "Input Queries", - "example": [ - "What does the company do?", - "What are the company's core products?", - "In which sectors does the company operate?" - ] - } - }, - "type": "object", - "required": [ - "website_url", - "input_queries" - ], - "title": "CrawlRequest" - }, - "CrawlSessionResponseBase": { - "properties": { - "session_id": { - "type": "string", - "title": "Session Id" - }, - "status": { - "$ref": "#/components/schemas/CrawlStatus" - }, - "website_url": { - "type": "string", - "title": "Website Url" - }, - "timestamp": { - "type": "string", - "format": "date-time", - "title": "Timestamp" - }, - "message": { - "type": "string", - "title": "Message" - } - }, - "type": "object", - "required": [ - "session_id", - "status", - "website_url", - "timestamp", - "message" - ], - "title": "CrawlSessionResponseBase" - }, - "CrawlSessionsList": { - "properties": { - "sessions": { - "items": { - "$ref": "#/components/schemas/CrawlSessionResponseBase" - }, - "type": "array", - "title": "Sessions" - }, - "total_sessions": { - "type": "integer", - "title": "Total Sessions" - }, - "message": { - "type": "string", - "title": "Message", - "default": "Sessions retrieved successfully" - } - }, - "type": "object", - "required": [ - "sessions", - "total_sessions" - ], - "title": "CrawlSessionsList" - }, - "CrawlStatus": { - "type": "string", - "enum": [ - "queued", - "processing", - "completed", - "failed" - ], - "title": "CrawlStatus" - }, "CreditsResponse": { "properties": { "remaining_credits": { @@ -823,87 +677,24 @@ "type": "object", "title": "HTTPValidationError" }, - "InProgressCrawlSessionResponse": { - "properties": { - "session_id": { - "type": "string", - "title": "Session Id" - }, - "status": { - "$ref": "#/components/schemas/CrawlStatus" - }, - "website_url": { - "type": "string", - "title": "Website Url" - }, - "timestamp": { - "type": "string", - "format": "date-time", - "title": "Timestamp" - }, - "message": { - "type": "string", - "title": "Message" - } - }, - "type": "object", - "required": [ - "session_id", - "status", - "website_url", - "timestamp", - "message" - ], - "title": "InProgressCrawlSessionResponse" - }, - "LocalscraperRequest": { - "properties": { - "user_prompt": { - "type": "string", - "title": "User Prompt", - "example": "Extract info about the company" - }, - "website_html": { - "type": "string", - "title": "Website Html", - "description": "HTML content, maximum size 2MB", - "example": "\u003Chtml\u003E\u003Cbody\u003E\u003Ch1\u003ETitle\u003C/h1\u003E\u003Cp\u003EContent\u003C/p\u003E\u003C/body\u003E\u003C/html\u003E" - }, - "output_schema": { - "anyOf": [ - { - "type": "object" - }, - { - "type": "null" - } - ], - "title": "Output Schema" - } - }, - "type": "object", - "required": [ - "user_prompt", - "website_html" - ], - "title": "LocalscraperRequest" - }, - "LocalscraperStatus": { - "type": "string", - "enum": [ - "queued", - "processing", - "completed", - "failed" - ], - "title": "LocalscraperStatus" - }, "MarkdownifyRequest": { "properties": { "website_url": { "type": "string", "title": "Website Url", "example": "https://scrapegraphai.com/" + }, + "headers": { + "type": "object", + "additionalProperties": { + "type": "string" + }, + "title": "Headers", + "description": "Optional headers to send with the request, including cookies and user agent", + "example": { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Cookie": "cookie1=value1; cookie2=value2" + } } }, "type": "object", @@ -934,6 +725,24 @@ "title": "Website Url", "example": "https://scrapegraphai.com/" }, + "website_html": { + "type": "string", + "title": "Website Html", + "description": "HTML content, maximum size 2MB", + "example": "

Title

Content

" + }, + "headers": { + "type": "object", + "additionalProperties": { + "type": "string" + }, + "title": "Headers", + "description": "Optional headers to send with the request, including cookies and user agent", + "example": { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Cookie": "cookie1=value1; cookie2=value2" + } + }, "output_schema": { "anyOf": [ { @@ -948,10 +757,10 @@ }, "type": "object", "required": [ - "user_prompt", - "website_url" + "user_prompt" ], - "title": "ScrapeRequest" + "title": "ScrapeRequest", + "description": "Either website_url or website_html must be provided" }, "SmartscraperStatus": { "type": "string", @@ -995,6 +804,127 @@ "type" ], "title": "ValidationError" + }, + "SearchScraperRequest": { + "properties": { + "user_prompt": { + "type": "string", + "title": "User Prompt", + "example": "What is the latest version of Python?", + "description": "Natural language query to search for information on the web" + }, + "headers": { + "type": "object", + "additionalProperties": { + "type": "string" + }, + "title": "Headers", + "description": "Optional headers to send with the request, including cookies and user agent", + "example": { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Cookie": "cookie1=value1; cookie2=value2" + } + }, + "output_schema": { + "anyOf": [ + { + "type": "object" + }, + { + "type": "null" + } + ], + "title": "Output Schema", + "description": "Optional schema to structure the response data", + "example": { + "answer": "string", + "details": { + "version": "string", + "release_date": "string", + "download_url": "string" + } + } + } + }, + "type": "object", + "required": [ + "user_prompt" + ], + "title": "SearchScraperRequest", + "description": "Request to search and extract information from multiple web sources" + }, + "CompletedSearchScraperResponse": { + "properties": { + "request_id": { + "type": "string", + "title": "Request Id", + "example": "123e4567-e89b-12d3-a456-426614174000" + }, + "status": { + "$ref": "#/components/schemas/SearchScraperStatus", + "example": "completed" + }, + "user_prompt": { + "type": "string", + "title": "User Prompt", + "example": "What is the latest version of Python?" + }, + "result": { + "anyOf": [ + { + "type": "object" + }, + { + "type": "null" + } + ], + "title": "Result", + "example": { + "answer": "The latest version of Python is 3.12.1", + "details": { + "version": "3.12.1", + "release_date": "December 7, 2023", + "download_url": "https://www.python.org/downloads/" + } + } + }, + "reference_urls": { + "type": "array", + "items": { + "type": "string" + }, + "title": "Reference URLs", + "description": "List of URLs used as references for the answer", + "example": [ + "https://www.python.org/downloads/", + "https://docs.python.org/release/3.12.1/whatsnew/3.12.html" + ] + }, + "error": { + "type": "string", + "title": "Error", + "default": "", + "example": "" + } + }, + "type": "object", + "required": [ + "request_id", + "status", + "user_prompt" + ], + "title": "CompletedSearchScraperResponse", + "description": "Response containing the search results and extracted information" + }, + "SearchScraperStatus": { + "type": "string", + "enum": [ + "queued", + "processing", + "completed", + "failed" + ], + "title": "SearchScraperStatus" } }, "securitySchemes": { diff --git a/dashboard/overview.mdx b/dashboard/overview.mdx index e45d193..df173d8 100644 --- a/dashboard/overview.mdx +++ b/dashboard/overview.mdx @@ -28,11 +28,12 @@ Track your API usage patterns with our detailed analytics view: The usage graph provides: -- **Service-specific metrics**: Track usage for SmartScraper, LocalScraper, and Markdownify separately +- **Service-specific metrics**: Track usage for SmartScraper, SearchScraper, and Markdownify separately - **Time-based analysis**: View usage patterns over different time periods - **Interactive tooltips**: Hover over data points to see detailed information - **Trend analysis**: Identify usage patterns and optimize your API consumption + ## Key Features - **Usage Statistics**: Monitor your API usage and remaining credits diff --git a/dashboard/playground.mdx b/dashboard/playground.mdx index 50d2072..234a30f 100644 --- a/dashboard/playground.mdx +++ b/dashboard/playground.mdx @@ -17,9 +17,10 @@ The playground allows you to test our APIs interactively without writing any cod Choose from our three main services in the center panel: - [**SmartScraper**](/services/smartscraper): AI-powered scraping for any website -- [**LocalScraper**](/services/localscraper): AI-powered scraping for local HTML content +- [**SearchScraper**](/services/searchscraper): Find and extract any data using AI starting from a prompt - [**Markdownify**](/services/markdownify): Convert web content to clean Markdown + ### Interactive Features - **Service Info**: The right sidebar provides detailed information about the selected service diff --git a/integrations/phidata.mdx b/integrations/agno.mdx similarity index 69% rename from integrations/phidata.mdx rename to integrations/agno.mdx index e12a97c..0cd6b8e 100644 --- a/integrations/phidata.mdx +++ b/integrations/agno.mdx @@ -1,18 +1,21 @@ --- -title: '🦐 Phidata' -description: 'Build AI Assistants with ScrapeGraph using Phidata' +title: '🦐 Agno' +description: 'Build AI Assistants with ScrapeGraphAI and Agno' --- + ## Overview -[Phidata](https://www.phidata.com) is a development framework for building production-ready AI Assistants. This integration allows you to easily add ScrapeGraph's web scraping capabilities to your Phidata-powered AI agents. +[Agno](https://www.agno.com) is a development framework for building production-ready AI Assistants. This integration allows you to easily add ScrapeGraph's web scraping capabilities to your Agno-powered AI agents. - Learn more about building AI Assistants with Phidata + Learn more about building AI Assistants with Agno ## Installation @@ -20,22 +23,25 @@ description: 'Build AI Assistants with ScrapeGraph using Phidata' Install the required packages: ```bash -pip install -U phidata + +pip install -U agno pip install scrapegraph-py ``` ## Usage + ### Basic Example Create an AI Assistant with ScrapeGraph tools: ```python -from phi.agent import Agent -from phi.tools.scrapegraph_tools import ScrapeGraphTools +from agno.agent import Agent +from agno.tools.scrapegraph_tools import ScrapeGraphTools # Initialize with smartscraper enabled + scrapegraph = ScrapeGraphTools(smartscraper=True) # Create an agent with the tools @@ -62,8 +68,9 @@ Use smartscraper to extract the following from https://www.wired.com/category/sc You can also use ScrapeGraph to convert web pages to markdown: ```python -from phi.agent import Agent -from phi.tools.scrapegraph_tools import ScrapeGraphTools +from agno.agent import Agent +from agno.tools.scrapegraph_tools import ScrapeGraphTools + # Initialize with only markdownify enabled scrapegraph_md = ScrapeGraphTools(smartscraper=False) @@ -104,16 +111,18 @@ Need help with the integration? - Join the Phidata community + + Join the Agno community Check out the source code diff --git a/integrations/langchain.mdx b/integrations/langchain.mdx index 5426909..aed504f 100644 --- a/integrations/langchain.mdx +++ b/integrations/langchain.mdx @@ -65,51 +65,39 @@ result = tool.invoke({ ``` -### LocalScraperTool +### SearchScraperTool Process HTML content directly with AI extraction: ```python -from langchain_scrapegraph.tools import LocalScraperTool +from langchain_scrapegraph.tools import SearchScraperTool -tool = LocalScraperTool() + +tool = SearchScraperTool() result = tool.invoke({ - "user_prompt": "Extract all contact information", - "website_html": "..." + "user_prompt": "Find the best restaurants in San Francisco", }) + ``` ```python from typing import Optional from pydantic import BaseModel, Field -from langchain_scrapegraph.tools import LocalScraperTool - -class CompanyInfo(BaseModel): - name: str = Field(description="The company name") - description: str = Field(description="The company description") - email: Optional[str] = Field(description="Contact email if available") - phone: Optional[str] = Field(description="Contact phone if available") - -tool = LocalScraperTool(llm_output_schema=CompanyInfo) - -html_content = """ - - -

TechCorp Solutions

-

We are a leading AI technology company.

-
-

Email: contact@techcorp.com

-

Phone: (555) 123-4567

-
- - -""" +from langchain_scrapegraph.tools import SearchScraperTool + +class RestaurantInfo(BaseModel): + name: str = Field(description="The restaurant name") + address: str = Field(description="The restaurant address") + rating: float = Field(description="The restaurant rating") + + +tool = SearchScraperTool(llm_output_schema=RestaurantInfo) result = tool.invoke({ - "website_html": html_content, - "user_prompt": "Extract the company information" + "user_prompt": "Find the best restaurants in San Francisco" }) + ```
diff --git a/introduction.mdx b/introduction.mdx index 01578f8..9876c8c 100644 --- a/introduction.mdx +++ b/introduction.mdx @@ -50,10 +50,11 @@ ScrapeGraphAI is a powerful suite of LLM-driven web scraping tools designed to e Learn how to manage your account, monitor jobs, and access your API keys
- Explore our core services: SmartScraper, LocalScraper, and Markdownify + Explore our core services: SmartScraper, SearchScraper, and Markdownify Implement with Python, JavaScript, or integrate with LangChain and LlamaIndex + Detailed API documentation for direct integration @@ -62,9 +63,12 @@ ScrapeGraphAI is a powerful suite of LLM-driven web scraping tools designed to e ## Core Services -- **SmartScraper**: AI-powered extraction for any website -- **LocalScraper**: AI-powered extraction for local HTML content -- **Markdownify**: Convert web content to clean Markdown format +- **SmartScraper**: AI-powered extraction for any webpage +- **SearchScraper**: Find and extract any data using AI starting from a prompt +- **Markdownify**: Convert web content to clean Markdown format + + + ## Implementation Options diff --git a/mint.json b/mint.json index b5c8b4b..ef778cf 100644 --- a/mint.json +++ b/mint.json @@ -83,11 +83,18 @@ "group": "Services", "pages": [ "services/smartscraper", - "services/localscraper", + "services/searchscraper", "services/markdownify", + { + "group": "Additional Parameters", + "pages": [ + "services/additional-parameters/headers" + ] + }, { "group": "Browser Extensions", "pages": [ + "services/extensions/firefox" ] } @@ -106,7 +113,7 @@ "integrations/langchain", "integrations/llamaindex", "integrations/crewai", - "integrations/phidata" + "integrations/agno" ] }, { @@ -141,12 +148,13 @@ ] }, { - "group": "LocalScraper", + "group": "SearchScraper", "pages": [ - "api-reference/endpoint/localscraper/start", - "api-reference/endpoint/localscraper/get-status" + "api-reference/endpoint/searchscraper/start", + "api-reference/endpoint/searchscraper/get-status" ] }, + { "group": "Markdownify", "pages": [ diff --git a/sdks/javascript.mdx b/sdks/javascript.mdx index 9d37f1d..933bd1a 100644 --- a/sdks/javascript.mdx +++ b/sdks/javascript.mdx @@ -154,30 +154,98 @@ response.result.offices.forEach(office => { ``` -### LocalScraper +### SearchScraper + +Search and extract information from multiple web sources using AI: -Process local HTML content with AI extraction: ```javascript -const html = ` - - -

Company Name

-

We are a technology company focused on AI solutions.

-
-

Email: contact@example.com

-
- - -`; - -const response = await localScraper( + +const response = await searchScraper( apiKey, - html, - 'Extract the company description' + 'Find the best restaurants in San Francisco', ); ``` + +Define a simple schema using Zod: + +```typescript +import { z } from 'zod'; + +const ArticleSchema = z.object({ + title: z.string().describe('The article title'), + author: z.string().describe('The author\'s name'), + publishDate: z.string().describe('Article publication date'), + content: z.string().describe('Main article content'), + category: z.string().describe('Article category') +}); + +const response = await searchScraper( + apiKey, + 'Find news about the latest trends in AI', + ArticleSchema +); + + +console.log(`Title: ${response.result.title}`); +console.log(`Author: ${response.result.author}`); +console.log(`Published: ${response.result.publishDate}`); +``` + + + +Define a complex schema for nested data structures: + +```typescript +import { z } from 'zod'; + +const EmployeeSchema = z.object({ + name: z.string().describe('Employee\'s full name'), + position: z.string().describe('Job title'), + department: z.string().describe('Department name'), + email: z.string().describe('Email address') +}); + +const OfficeSchema = z.object({ + location: z.string().describe('Office location/city'), + address: z.string().describe('Full address'), + phone: z.string().describe('Contact number') +}); + +const RestaurantSchema = z.object({ + name: z.string().describe('Restaurant name'), + address: z.string().describe('Restaurant address'), + rating: z.number().describe('Restaurant rating'), + website: z.string().url().describe('Restaurant website URL') + + +}); + +// Extract comprehensive company information +const response = await searchScraper( + apiKey, + 'Find the best restaurants in San Francisco', + RestaurantSchema +); + + + +// Access nested data +console.log(`Restaurant: ${response.result.name}`); +console.log('\nAddress:'); +response.result.address.forEach(address => { + console.log(`- ${address}`); +}); + + +console.log('\nRating:'); +console.log(`- ${response.result.rating}`); +``` + + + + ### Markdownify Convert any webpage into clean, formatted markdown: diff --git a/sdks/python.mdx b/sdks/python.mdx index 012dc15..37432ec 100644 --- a/sdks/python.mdx +++ b/sdks/python.mdx @@ -133,29 +133,96 @@ for office in response.offices: ``` -### LocalScraper +### SearchScraper -Process local HTML content with AI extraction: +Search and extract information from multiple web sources using AI: ```python -html_content = """ - - -

Company Name

-

We are a technology company focused on AI solutions.

-
-

Email: contact@example.com

-
- - -""" - -response = client.localscraper( - user_prompt="Extract the company description", - website_html=html_content +response = client.searchscraper( + user_prompt="What are the key features and pricing of ChatGPT Plus?" ) ``` + +Define a simple schema for structured search results: + +```python +from pydantic import BaseModel, Field +from typing import List + +class ProductInfo(BaseModel): + name: str = Field(description="Product name") + description: str = Field(description="Product description") + price: str = Field(description="Product price") + features: List[str] = Field(description="List of key features") + availability: str = Field(description="Availability information") + +response = client.searchscraper( + user_prompt="Find information about iPhone 15 Pro", + output_schema=ProductInfo +) + +print(f"Product: {response.name}") +print(f"Price: {response.price}") +print("\nFeatures:") +for feature in response.features: + print(f"- {feature}") +``` + + + +Define a complex schema for comprehensive market research: + +```python +from typing import List +from pydantic import BaseModel, Field + +class MarketPlayer(BaseModel): + name: str = Field(description="Company name") + market_share: str = Field(description="Market share percentage") + key_products: List[str] = Field(description="Main products in market") + strengths: List[str] = Field(description="Company's market strengths") + +class MarketTrend(BaseModel): + name: str = Field(description="Trend name") + description: str = Field(description="Trend description") + impact: str = Field(description="Expected market impact") + timeframe: str = Field(description="Trend timeframe") + +class MarketAnalysis(BaseModel): + market_size: str = Field(description="Total market size") + growth_rate: str = Field(description="Annual growth rate") + key_players: List[MarketPlayer] = Field(description="Major market players") + trends: List[MarketTrend] = Field(description="Market trends") + challenges: List[str] = Field(description="Industry challenges") + opportunities: List[str] = Field(description="Market opportunities") + +# Perform comprehensive market research +response = client.searchscraper( + user_prompt="Analyze the current AI chip market landscape", + output_schema=MarketAnalysis +) + +# Access structured market data +print(f"Market Size: {response.market_size}") +print(f"Growth Rate: {response.growth_rate}") + +print("\nKey Players:") +for player in response.key_players: + print(f"\n{player.name}") + print(f"Market Share: {player.market_share}") + print("Key Products:") + for product in player.key_products: + print(f"- {product}") + +print("\nMarket Trends:") +for trend in response.trends: + print(f"\n{trend.name}") + print(f"Impact: {trend.impact}") + print(f"Timeframe: {trend.timeframe}") +``` + + ### Markdownify Convert any webpage into clean, formatted markdown: diff --git a/services/additional-parameters/headers.mdx b/services/additional-parameters/headers.mdx new file mode 100644 index 0000000..6b427f7 --- /dev/null +++ b/services/additional-parameters/headers.mdx @@ -0,0 +1,231 @@ +--- +title: 'Headers & Cookies' +description: 'Customize request headers and cookies for web scraping' +icon: 'gear' +--- + + + Headers Configuration + + +## Overview + +All our services (SmartScraper, SearchScraper, and Markdownify) support custom headers and cookies to help you: +- Bypass basic anti-bot protections +- Access authenticated content +- Maintain sessions +- Customize request behavior + +## Headers + +### Common Headers + +You can set any of the following headers in your requests: + +```json +{ + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", // Browser identification + "Accept": "*/*", // Accepted content types + "Accept-Encoding": "gzip, deflate, br", // Supported encodings + "Accept-Language": "en-US,en;q=0.9", // Preferred languages + "Cache-Control": "no-cache,no-cache", // Caching behavior + "Sec-Ch-Ua": "\"Google Chrome\";v=\"107\", \"Chromium\";v=\"107\"", // Browser details + "Sec-Ch-Ua-Mobile": "?0", // Mobile browser flag + "Sec-Ch-Ua-Platform": "\"macOS\"", // Operating system + "Sec-Fetch-Dest": "document", // Request destination + "Sec-Fetch-Mode": "navigate", // Request mode + "Sec-Fetch-Site": "none", // Request origin + "Sec-Fetch-User": "?1", // User-initiated flag + "Upgrade-Insecure-Requests": "1" // HTTPS upgrade +} +``` + +### Usage Examples + + + +```python Python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +# Define custom headers +headers = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Accept-Language": "en-US,en;q=0.9", + "Sec-Ch-Ua-Platform": "\"Windows\"" +} + +# Use with SmartScraper +response = client.smartscraper( + website_url="https://example.com", + user_prompt="Extract the main content", + headers=headers +) + +# Use with SearchScraper +response = client.searchscraper( + user_prompt="Find information about...", + headers=headers +) + +# Use with Markdownify +response = client.markdownify( + website_url="https://example.com", + headers=headers +) +``` + +```typescript TypeScript +import { Client } from '@scrapegraph/sdk'; + +const client = new Client('your-api-key'); + +// Define custom headers +const headers = { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', + 'Accept-Language': 'en-US,en;q=0.9', + 'Sec-Ch-Ua-Platform': '"Windows"' +}; + +// Use with SmartScraper +const response = await client.smartscraper({ + websiteUrl: 'https://example.com', + userPrompt: 'Extract the main content', + headers: headers +}); +``` + + + +## Cookies + +### Overview + +Cookies are essential for: +- Accessing authenticated content +- Maintaining user sessions +- Handling website preferences +- Bypassing certain security measures + +### Setting Cookies + +Cookies are set using the `Cookie` header as a semicolon-separated string of key-value pairs: + +```python +headers = { + "Cookie": "session_id=abc123; user_id=12345; theme=dark" +} +``` + +### Examples + + + +```python Python +# Example with session cookies +headers = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Cookie": "session_id=abc123; user_id=12345; theme=dark" +} + +response = client.smartscraper( + website_url="https://example.com/dashboard", + user_prompt="Extract user information", + headers=headers +) +``` + +```typescript TypeScript +// Example with session cookies +const headers = { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', + 'Cookie': 'session_id=abc123; user_id=12345; theme=dark' +}; + +const response = await client.smartscraper({ + websiteUrl: 'https://example.com/dashboard', + userPrompt: 'Extract user information', + headers: headers +}); +``` + + + +### Common Use Cases + +1. **Authentication** +```python +headers = { + "Cookie": "auth_token=xyz789; session_id=abc123" +} +``` + +2. **Regional Settings** +```python +headers = { + "Cookie": "country=US; language=en; currency=USD" +} +``` + +3. **User Preferences** +```python +headers = { + "Cookie": "theme=dark; notifications=enabled" +} +``` + +## Best Practices + +1. **User Agent Best Practices** + - Use recent browser versions + - Match User-Agent with Sec-Ch-Ua headers + - Consider region-specific variations + +2. **Cookie Management** + - Keep cookies up to date + - Include all required session cookies + - Remove unnecessary cookies + - Handle cookie expiration + +3. **Security Considerations** + - Don't share sensitive cookies + - Rotate User-Agents when appropriate + - Use HTTPS when sending sensitive data + +## Common Issues + + +Cookies may expire during scraping. Solutions: +- Implement cookie refresh logic +- Monitor session status +- Handle re-authentication + + + +Some headers may conflict. Common fixes: +- Remove conflicting headers +- Ensure header values match +- Check case sensitivity + + +## Support + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + + + + Contact our support team for assistance with headers, cookies, or any other questions! + diff --git a/services/images/searchscraper-banner.png b/services/images/searchscraper-banner.png new file mode 100644 index 0000000..4449aa3 Binary files /dev/null and b/services/images/searchscraper-banner.png differ diff --git a/services/localscraper.mdx b/services/localscraper.mdx deleted file mode 100644 index a8ce92a..0000000 --- a/services/localscraper.mdx +++ /dev/null @@ -1,260 +0,0 @@ ---- -title: 'LocalScraper' -description: 'AI-powered extraction from local HTML content' -icon: 'file-code' ---- - - - LocalScraper Service - - -## Overview - -LocalScraper brings the same powerful AI extraction capabilities as SmartScraper but works with your local HTML content. This makes it perfect for scenarios where you already have the HTML content or need to process cached pages, internal documents, or dynamically generated content. - - -Try LocalScraper instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) - no coding required! - - -## Key Features - - - - Process HTML content directly without making external requests - - - Same powerful AI extraction as SmartScraper - - - No network latency or website loading delays - - - Complete control over your HTML input and processing - - - -## Use Cases - -### Internal Systems -- Process internally cached pages -- Extract from intranet content -- Handle dynamic JavaScript renders -- Process email templates - -### Batch Processing -- Archive data extraction -- Historical content analysis -- Bulk document processing -- Offline content processing - -### Development & Testing -- Test extraction logic locally -- Debug content processing -- Prototype without API calls -- Validate schemas offline - - -Want to learn more about our AI-powered scraping technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web data extraction. - - -## Getting Started - -### Quick Start - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key") - -html_content = """ - - -

ScrapeGraphAI

-
-

AI-powered web scraping for modern applications.

-
-
-
    -
  • Smart Extraction
  • -
  • Local Processing
  • -
  • Schema Support
  • -
-
- - -""" - -response = client.localscraper( - website_html=html_content, - user_prompt="Extract the company information and features" -) -``` - - -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) - - - -```json -{ - "request_id": "sg-req-xyz789", - "status": "completed", - "user_prompt": "Extract the company information and features", - "result": { - "company_name": "ScrapeGraphAI", - "description": "AI-powered web scraping for modern applications.", - "features": [ - "Smart Extraction", - "Local Processing", - "Schema Support" - ] - }, - "error": "" -} -``` - -The response includes: -- `request_id`: Unique identifier for tracking your request -- `status`: Current status of the extraction -- `result`: The extracted data in structured JSON format -- `error`: Error message (if any occurred during extraction) - - -## Advanced Usage - -### Custom Schema Example - -Define exactly what data you want to extract: - - - -```python Python -from pydantic import BaseModel, Field -from typing import List - -class ProductData(BaseModel): - name: str = Field(description="Product name") - price: str = Field(description="Product price") - description: str = Field(description="Product description") - specifications: List[str] = Field(description="Product specifications") - -response = client.localscraper( - website_html=html_content, - user_prompt="Extract the product information", - output_schema=ProductData -) -``` - -```typescript TypeScript -import { z } from 'zod'; - -const ProductSchema = z.object({ - name: z.string().describe('Product name'), - price: z.string().describe('Product price'), - description: z.string().describe('Product description'), - specifications: z.array(z.string()).describe('Product specifications') -}); - -const response = await localScraper( - apiKey, - html_content, - 'Extract the product information', - ProductSchema -); -``` - - - -### Async Support - -For applications requiring asynchronous execution, LocalScraper provides async support through the `AsyncClient`: - -```python -from scrapegraph_py import AsyncClient -import asyncio - -async def main(): - html_content = """ - - -

Product: Gaming Laptop

-
$999.99
-
- High-performance gaming laptop with RTX 3080. -
- - - """ - - async with AsyncClient(api_key="your-api-key") as client: - response = await client.localscraper( - website_html=html_content, - user_prompt="Extract the product information" - ) - print(response) - -# Run the async function -asyncio.run(main()) -``` - -## Integration Options - -### Official SDKs -- [Python SDK](/sdks/python) - Perfect for data science and backend applications -- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js - -### AI Framework Integrations -- [LangChain Integration](/integrations/langchain) - Use LocalScraper in your LLM workflows -- [LlamaIndex Integration](/integrations/llamaindex) - Build powerful search and QA systems - -## Best Practices - -### HTML Preparation -1. Ensure HTML is well-formed -2. Include relevant content only -3. Clean up unnecessary markup -4. Handle character encoding properly - -### Optimization Tips -- Remove unnecessary scripts and styles -- Clean up dynamic content placeholders -- Preserve important semantic structure -- Include relevant metadata - -## Example Projects - -Check out our [cookbook](/cookbook/introduction) for real-world examples: -- Dynamic content extraction -- Email template processing -- Cached content analysis -- Batch HTML processing - -## API Reference - -For detailed API documentation, see: -- [Start Scraping Job](/api-reference/endpoint/localscraper/start) -- [Get Job Status](/api-reference/endpoint/localscraper/get-status) - -## Support & Resources - - - - Comprehensive guides and tutorials - - - Detailed API documentation - - - Join our Discord community - - - Check out our open-source projects - - - Visit our official website - - - - - Sign up now and get your API key to begin processing your HTML content with LocalScraper! - diff --git a/services/markdownify.mdx b/services/markdownify.mdx index 97e15dc..1dc4561 100644 --- a/services/markdownify.mdx +++ b/services/markdownify.mdx @@ -16,6 +16,42 @@ Markdownify is our specialized service that transforms web content into clean, w Try Markdownify instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) - no coding required! +## Getting Started + +### Quick Start + +```python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +response = client.markdownify( + website_url="https://example.com/article" +) +``` + + +Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) + + + +```json +{ + "request_id": "sg-req-md456", + "status": "completed", + "website_url": "https://example.com/article", + "result": "# Understanding AI-Powered Web Scraping\n\nWeb scraping has evolved significantly with the advent of AI technologies...\n\n## Key Benefits\n\n- Improved accuracy\n- Intelligent extraction\n- Structured output\n\n![AI Scraping Process](https://example.com/images/ai-scraping.png)\n\n> AI-powered scraping represents the future of web data extraction.\n\n### Getting Started\n\n1. Choose your target website\n2. Define extraction goals\n3. Select appropriate tools\n", + "error": "" +} +``` + +The response includes: +- `request_id`: Unique identifier for tracking your request +- `status`: Current status of the conversion +- `result`: Object containing the markdown content and metadata +- `error`: Error message (if any occurred during conversion) + + ## Key Features @@ -57,41 +93,7 @@ Try Markdownify instantly in our [interactive playground](https://dashboard.scra Want to learn more about our AI-powered scraping technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web data extraction. -## Getting Started - -### Quick Start - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key") - -response = client.markdownify( - website_url="https://example.com/article" -) -``` - - -Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) - - - -```json -{ - "request_id": "sg-req-md456", - "status": "completed", - "website_url": "https://example.com/article", - "result": "# Understanding AI-Powered Web Scraping\n\nWeb scraping has evolved significantly with the advent of AI technologies...\n\n## Key Benefits\n\n- Improved accuracy\n- Intelligent extraction\n- Structured output\n\n![AI Scraping Process](https://example.com/images/ai-scraping.png)\n\n> AI-powered scraping represents the future of web data extraction.\n\n### Getting Started\n\n1. Choose your target website\n2. Define extraction goals\n3. Select appropriate tools\n", - "error": "" -} -``` - -The response includes: -- `request_id`: Unique identifier for tracking your request -- `status`: Current status of the conversion -- `result`: Object containing the markdown content and metadata -- `error`: Error message (if any occurred during conversion) - +## Advanced Usage ### Async Support diff --git a/services/searchscraper.mdx b/services/searchscraper.mdx new file mode 100644 index 0000000..1f3d73d --- /dev/null +++ b/services/searchscraper.mdx @@ -0,0 +1,309 @@ +--- +title: 'SearchScraper' +description: 'Search and extract information from multiple web sources using AI' +icon: 'magnifying-glass' +--- + + + SearchScraper Service + + +## Overview + +SearchScraper is our advanced LLM-powered search service that intelligently searches and aggregates information from multiple web sources. Using state-of-the-art language models, it understands your queries and extracts relevant information across the web, providing comprehensive answers with full source attribution. + + +Try SearchScraper instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) - no coding required! + + +## Getting Started + +### Quick Start + +```python +from scrapegraph_py import Client + +client = Client(api_key="your-api-key") + +# Execute search with schema +response = client.searchscraper( + user_prompt="What are the key features and pricing of ChatGPT Plus?", +) +``` + + +Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) + + + +```json +{ + "request_id": "sg-req-abc123", + "status": "completed", + "user_prompt": "What are the key features and pricing of ChatGPT Plus?", + "result": { + "product": { + "name": "ChatGPT Plus", + "description": "Premium version of ChatGPT with advanced features and capabilities", + "target_audience": "Power users and professionals requiring advanced AI capabilities" + }, + "features": [ + { + "name": "GPT-4 Access", + "description": "Access to the latest GPT-4 language model" + }, + { + "name": "Response Speed", + "description": "Faster response times compared to free tier" + }, + { + "name": "Priority Access", + "description": "Guaranteed access during peak usage times" + }, + { + "name": "New Features", + "description": "Early access to new features and improvements" + }, + { + "name": "Plugin Support", + "description": "Access to third-party plugins and integrations" + } + ], + "pricing": { + "plans": [ + { + "name": "Plus Subscription", + "price": { + "amount": 20, + "currency": "USD", + "period": "monthly" + }, + "features": [ + "GPT-4 access", + "Faster response speed", + "Priority access during peak times", + "Early feature access", + "Plugin support" + ] + } + ] + }, + "availability": { + "regions": [ + "United States", + "European Union", + "United Kingdom", + "Most other countries" + ], + "restrictions": [ + "Not available in sanctioned countries", + "Requires credit card for subscription" + ] + } + }, + "reference_urls": [ + "https://openai.com/chatgpt", + "https://openai.com/blog/chatgpt-plus", + "https://help.openai.com/en/articles/6825453-chatgpt-plus-plan" + ], + "error": "" +} +``` + +The response includes: +- `request_id`: Unique identifier for tracking your request +- `status`: Current status of the search ("completed", "running", "failed") +- `result`: The extracted data in structured JSON format +- `reference_urls`: Source URLs for verification +- `error`: Error message (if any occurred during search) + + +## Key Features + + + + Intelligent search across multiple reliable web sources + + + Advanced LLM models for accurate information extraction + + + Clean, structured data in your preferred format + + + Full transparency with reference URLs + + + +## Use Cases + +### Research & Analysis +- Academic research and fact-finding +- Market research and competitive analysis +- Technology trend analysis +- Industry insights gathering + +### Data Aggregation +- Product research and comparison +- Company information compilation +- Price monitoring across sources +- Technology stack analysis + +### Content Creation +- Fact verification and citation +- Content research and inspiration +- Data-driven article writing +- Knowledge base building + + +Want to learn more about our AI-powered search technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web research. + + +## Advanced Usage + +### Custom Schema Example + +Define exactly what data you want to extract using Pydantic or Zod: + + + +```python Python +from pydantic import BaseModel, Field +from typing import List + +class CompanyProfile(BaseModel): + name: str = Field(description="Company name") + description: str = Field(description="Brief company description") + founded_year: str = Field(description="Year the company was founded") + headquarters: str = Field(description="Company headquarters location") + employees: str = Field(description="Number of employees") + industry: str = Field(description="Primary industry") + products: List[str] = Field(description="Main products or services") + competitors: List[str] = Field(description="Major competitors") + market_share: str = Field(description="Company's market share") + revenue: str = Field(description="Annual revenue") + tech_stack: List[str] = Field(description="Technologies used by the company") + +response = client.searchscraper( + user_prompt="Find comprehensive information about OpenAI", + output_schema=CompanyProfile +) +``` + +```typescript TypeScript +import { z } from 'zod'; + +const CompanyProfile = z.object({ + name: z.string().describe('Company name'), + description: z.string().describe('Brief company description'), + foundedYear: z.string().describe('Year the company was founded'), + headquarters: z.string().describe('Company headquarters location'), + employees: z.string().describe('Number of employees'), + industry: z.string().describe('Primary industry'), + products: z.array(z.string()).describe('Main products or services'), + competitors: z.array(z.string()).describe('Major competitors'), + marketShare: z.string().describe('Company\'s market share'), + revenue: z.string().describe('Annual revenue'), + techStack: z.array(z.string()).describe('Technologies used by the company') +}); + +const response = await client.search({ + userPrompt: 'Find comprehensive information about OpenAI', + outputSchema: CompanyProfile +}); +``` + + + +### Async Support + +For applications requiring asynchronous execution: + +```python +from scrapegraph_py import AsyncClient +import asyncio + +async def main(): + async with AsyncClient(api_key="your-api-key") as client: + + response = await client.searchscraper( + user_prompt="Analyze the current AI chip market", + ) + + # Process the structured results + market_data = response.result + print(f"Market Size: {market_data['market_overview']['total_size']}") + print(f"Growth Rate: {market_data['market_overview']['growth_rate']}") + print("\nKey Players:") + for player in market_data['market_overview']['key_players']: + print(f"- {player}") + +# Run the async function +asyncio.run(main()) +``` + +## Integration Options + +### Official SDKs +- [Python SDK](/sdks/python) - Perfect for data science and backend applications +- [JavaScript SDK](/sdks/javascript) - Ideal for web applications and Node.js + +### AI Framework Integrations +- [LangChain Integration](/integrations/langchain) - Use SearchScraper in your LLM workflows +- [LlamaIndex Integration](/integrations/llamaindex) - Build powerful search and QA systems +- [CrewAI Integration](/integrations/crewai) - Create AI agents with search capabilities + +## Best Practices + +### Query Optimization +1. Be specific in your prompts +2. Use descriptive queries +3. Include relevant context +4. Specify time-sensitive requirements + +### Schema Design +- Start with essential fields +- Use appropriate data types +- Add field descriptions +- Make optional fields nullable +- Group related information + +### Rate Limiting +- Implement reasonable delays between requests +- Use async clients for better performance +- Monitor your [API usage](/dashboard/overview) + +## Example Projects + +Check out our [cookbook](/cookbook/introduction) for real-world examples: +- [Company Research](/cookbook/examples/company-info) +- [Market Analysis](/cookbook/examples/research-agent) +- [Technology Trends](/cookbook/examples/github-trending) +- [News Aggregation](/cookbook/examples/wired) + +## API Reference + +For detailed API documentation, see: +- [Start Search](/api-reference/endpoint/searchscraper/start) +- [Get Search Status](/api-reference/endpoint/searchscraper/get-status) + +## Support & Resources + + + + Comprehensive guides and tutorials + + + Detailed API documentation + + + Join our Discord community + + + Check out our open-source projects + + + + + Sign up now and get your API key to begin searching and extracting data with SearchScraper! + \ No newline at end of file diff --git a/services/smartscraper.mdx b/services/smartscraper.mdx index 70600bb..226e86b 100644 --- a/services/smartscraper.mdx +++ b/services/smartscraper.mdx @@ -16,47 +16,6 @@ SmartScraper is our flagship LLM-powered web scraping service that intelligently Try SmartScraper instantly in our [interactive playground](https://dashboard.scrapegraphai.com/) - no coding required! -## Key Features - - - - Works with any website structure, including JavaScript-rendered content - - - Contextual understanding of content for accurate extraction - - - Returns clean, structured data in your preferred format - - - Define custom output schemas using Pydantic or Zod - - - -## Use Cases - -### Content Aggregation -- News article extraction -- Blog post summarization -- Product information gathering -- Research data collection - -### Data Analysis -- Market research -- Competitor analysis -- Price monitoring -- Trend tracking - -### AI Training -- Dataset creation -- Training data collection -- Content classification -- Knowledge base building - - -Want to learn more about our AI-powered scraping technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web data extraction. - - ## Getting Started ### Quick Start @@ -76,6 +35,7 @@ response = client.smartscraper( Get your API key from the [dashboard](https://dashboard.scrapegraphai.com) + ```json { @@ -109,6 +69,87 @@ The response includes: - `error`: Error message (if any occurred during extraction) + +Instead of providing a URL, you can optionally pass your own HTML content: + +```python +html_content = """ + + +

ScrapeGraphAI

+
+

AI-powered web scraping for modern applications.

+
+
+
    +
  • Smart Extraction
  • +
  • Local Processing
  • +
  • Schema Support
  • +
+
+ + +""" + +response = client.smartscraper( + website_html=html_content, # This will override website_url if both are provided + user_prompt="Extract info about the company" +) +``` + +This is useful when: +- You already have the HTML content cached +- You want to process modified HTML +- You're working with dynamically generated content +- You need to process content offline +- You want to pre-process the HTML before extraction + + +When both `website_url` and `website_html` are provided, `website_html` takes precedence and will be used for extraction. + +
+ +## Key Features + + + + Works with any website structure, including JavaScript-rendered content + + + Contextual understanding of content for accurate extraction + + + Returns clean, structured data in your preferred format + + + Define custom output schemas using Pydantic or Zod + + + +## Use Cases + +### Content Aggregation +- News article extraction +- Blog post summarization +- Product information gathering +- Research data collection + +### Data Analysis +- Market research +- Competitor analysis +- Price monitoring +- Trend tracking + +### AI Training +- Dataset creation +- Training data collection +- Content classification +- Knowledge base building + + +Want to learn more about our AI-powered scraping technology? Visit our [main website](https://scrapegraphai.com) to discover how we're revolutionizing web data extraction. + + ## Advanced Usage ### Custom Schema Example