Help:Check prices

DCCWiki, a community DCC encyclopedia.
Jump to: navigation, search

Summary: DCCWiki has implemented a system to find and track various DCC-related products from various vendors. Prices are linked from the product details page.

The vendor pricing system is very young and is still under active development.

Overview

The DCCWiki check pricing system was implanted to help users find items in stock and the pricing for each item from various vendors (resellers). Prices are periodically checked and updated as needed.

Why

This system was developed in response to a problem of finding parts that some users were facing during the Covid-19 pandemic supply shortage. Many vendors were out of stock on many items. This system was able to collect quantity information directly from vendors' websites along with pricing and currency information.

Current and Pending Vendors

The Pending Vendors page show in-progress vendors. Be sure to check the tab titled 'Won't implement' for a list of vendors that have been reviewed and will not be added.

There's also a list of currently available vendors as well.

Adding vendors

Main article: Help:Adding vendors

Logged-in users may submit requests to add vendors. See the Adding vendors article for details about requirements.

How it works

A complex matching system has been created to rank all the active vendor's website pages for matching content for every product known to DCCWiki. For this, we use stored meta information from all vendors' web pages.

Crawling vendor websites

When a new vendor is added, or when a website needs to be refreshed, we check common URLs for a sitemap file for a list of content for crawl. If none is found, DCCWiki carefully and systematically crawls each page on a vendor's website for product information and additional links to check. The system also checks the vendor's robots.txt file for pages that should not be crawled. Additionally, we have created a lengthy list of URL strings to avoid being crawled, such as cart.php or cart.aspx (there are currently over 340 words being blocked).

How we crawl

An automated script is used to control a monitor-less Chrome/Firefox browser. This allows our service to use various on-demand and dynamically provisioned servers from random geographically-diverse locations.

What we collect

Various tools are used to determine if the current webpage contains any product information. If so, that information is captured and stored. Additionally, publicly available data that was sent to us from the webserver is harvested and meta-data is created. While the webpage is in 'view', a screenshot is taken which is used to display a rendering of the website to end-users. This screenshot is thumbnail-sized and falls under fair use due to its low resolution and small size.

Any new links found are automatically added to the crawl queue as a low priority, ensuring that the URL is not banned using the robots.txt file or our internal checks.

How we use meta information

When a new DCC product is added to the DCCWiki, all the web pages with meta information that was previously collected are checked for any available pricing information for it. If so, that information is used to create a new product pricing information page automatically. No additional crawling web pages are required.

Frequency of crawling

Webpages are re-crawled based on various factors, some of these include:

  1. New web links found, but never crawled - Added to queue as low priority
  2. If the webpage contains product information, but no matching DCCWiki product article found - Once every 4 to 8 weeks, added to the queue as medium priority.
  3. If the webpage contains product information that also matches up - Once every 1 to 4 days, added to the queue as high priority.
  4. If the page returns an error (404/500), the page is removed from being crawled for 1 month. On the second time, the page is removed permanently and is prohibited from being crawled again.

Regardless of the priority, We strive to crawl at a rate of no more than 1 to 6 pages per minute. Webservers are designed to serve hundreds, thousands, and sometimes ten of thousands of webpage views per minute. Our crawl rate is slow enough that a vendor's web server(s) should not notice this process.

See Also