WebCrawlerConfiguration
Provides the configuration information required for Amazon Kendra Web Crawler.
Types
Properties
Configuration information required to connect to websites using authentication.
The 'depth' or number of levels from the seed level to crawl. For example, the seed URL page is depth 1 and any hyperlinks on this page that are also crawled are depth 2.
The maximum size (in MB) of a web page or attachment to crawl.
The maximum number of URLs on a web page to include when crawling a website. This number is per web page.
The maximum number of URLs crawled per website host per minute.
Configuration information required to connect to your internal websites via a web proxy.
A list of regular expression patterns to exclude certain URLs to crawl. URLs that match the patterns are excluded from the index. URLs that don't match the patterns are included in the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence and the URL file isn't included in the index.
A list of regular expression patterns to include certain URLs to crawl. URLs that match the patterns are included in the index. URLs that don't match the patterns are excluded from the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence and the URL file isn't included in the index.