HeadlessVidX

How to Train and Add Websites

Please note that while extracting video streaming links from many websites is possible, not all will be compatible or trainable. Many websites won’t require a site definition to extract the video. The easiest sites are those where the video starts playing as soon as the page loads, and these usually don't need any training.

However, if you don't see a video start playing immediately, it doesn't necessarily mean HeadlessVidX needs training. Many video players use preloading, which begins loading the video in the browser before the user clicks the play button. Preloading can make it appear as though the video isn't playing when it is actually being prepared for playback in the background. This process simplifies the task of intercepting the video’s location since the video data is already being loaded by the browser. As a result, HeadlessVidX can more easily detect and return the streaming link, allowing for seamless extraction and playback without needing to click a button.

Furthermore, while HeadlessVidX can handle a vast array of websites, some will inevitably employ advanced anti-scraping measures or sophisticated encryption methods. In these cases, manual intervention and training might be required to successfully extract the video streams.

To optimize HeadlessVidX for your specific needs, you can configure various settings. These settings allow you to fine-tune the behavior and capabilities of HeadlessVidX, ensuring it performs effectively across different scenarios. Let’s take a look at each setting and how it can enhance your video extraction process:

Site Profile

After running a test, the 'Response Data' will display the results. If the test is successful and a video is found, an 'Add Site' button will appear. If the test fails the button will not be shown, and you will need to adjust the settings and try again. Once a site has been added it will be listed in the 'Load Profile' section. Adding a site that already exists will not be duplicated, instead it will be overwritten with the new entry.

Video Page URL

The url leading to the page that the video is located on. If the video has to be clicked, or a page naviation needs to take place before getting to the video page, use the css click selectors.

Show Browser

Choose whether or not to display the browser during testing. Enabling this option allows you to visually inspect the process. Note that the show Browser value is not saved when adding the site, and by default, the browser will be hidden.

Page Timeout

The timeout value in seconds for the page operations. This is how long to wait before abandoning the request and returning it as a failed attempt.

Select Browser

Choosing the right browser is an important step for the overall success of the extraction of the video url.

Firefox: Firefox is the most reliable browser because its fingerprinting looks the most human, making it less likely to be detected as a bot. However, Firefox can be slower when running tasks. When running in 'Show Browser' mode, it is even slower, but it will be faster during a production run.

Chrome: Chrome is much faster than Firefox but is more often identified as a bot by anti-bot detection systems. If you notice HeadlessVidX is being blocked because of a devtool detection script, try adding a string to the blacklist/javascripts-loading.txt that identifies and blocks the script from loading.

WebKit: WebKit may sometimes be identified as an unsupported browser, but it is included as an option for those who might need it.

Video Must Contain

The video URL that needs to be extracted 'must' contain this string, or strings. You can use a list of comma separated strings. Example: If the video I wanted was located at http://example.com/video.mp4 I could use the string 'example'. Any other video files found on the page will be ignored.

Video 'Not' Contain

The returned video URL must 'not' contain this string, or strings. You can use a list of comma separated strings. Example: If an ad video is on the video page located at http://advertise-example.com/video.mp4 I could use 'advertise' to blacklist any video containing that string.

Click Selectors

The click Selectors provide various methods to target and click the elements on a webpage. You can use general CSS selectors like class, ID, or tag selectors (e.g., .btn-play, #main-content), or you can use XPath selectors to target the elements on the page that requires a click. Running JavaScript on a page is also possible by using the 'javascript:' scheme.

If you need HeadlessVidX to click on multiple items or navigate through pages, you can use a comma to separate the selectors. By using comma-separated selectors can be very helpful, like when you need to interact with several elements on the page, such as navigating through different sections or pages, clicking multiple buttons, or performing a series of actions in a specific order.

Some Examples:

CSS:
play-btn

Xpath:
//*[@id='play-btn']

Javascript:
javascript:document.getElementById('play-btn').click();

Tip: Try the free version of the SelectorsHub plugin for Chrome and Firefox. By clicking on an element, this tool will automatically generate multiple expressions for you.

Using Multiple Selectors

You can pass multiple selectors separated by commas. For example:

.btn-play, #main-content, //button[@class='btn-play'], //div[text()='Welcome']

Brute Click

Brute Click is a feature that uses x and y coordinates to click a specific point on a webpage. This can be useful when you need to interact with an element that cannot be easily targeted with a standard selector. However, for Brute Click to work effectively, at least one Click Selector is required. The Click Selector serves as an element to wait for, ensuring that the page has loaded sufficiently before the x and y coordinates are clicked.

Stealth Mode

Enable stealth mode to avoid detection by anti-bot measures. Set this option to true or false. Be aware that some websites can detect the stealth browser, but not the regular browser. If you encounter issues, try switching between stealth mode and the regular browser.

Embed Video

Replace the body content with an iframe containing the specified URL. Set this option to true or false. Enabling this setting is useful if HeadlessVidX needs to interact with the video player within an iframe. This can be particularly helpful for sites that require embedding the video player to function correctly.

Include Referer

The referrer is the URL of the previous web page from which a link to the current page was followed. Some websites use the referrer information to enhance security and prevent direct access from outside sources. Specify the referrer URL to be included in the request headers if needed.

Block Resources

List resource types to block during the process, separated by commas. Puppeteer supports blocking the following resource types: document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, and other. Be careful when blocking 'script' as it may cause the video to not load.

Block Ads

Enable ad blocking using the EasyList. When this option is enabled, all URLs are filtered through the Adblock Plus EasyList, which blocks known ad domains and prevents advertisements from loading. This helps to speed up page load times and reduce page distractions from ads.

Block JS Contains

Enable blocking of JavaScript content based on a blacklist. When this option is enabled, the process inspects and searches each loading JavaScript file for strings listed in blacklist/javascripts-contains.txt. Be aware that enabling this feature may slow down the process due to the additional inspection required.

Custom Useragent

If the website hosting the video blocks HeadlessVidX when it uses random user agents, you can set a custom user agent to try and bypass this issue. Many streaming sites employ 'theajack/disable-devtool,' which may cause the browser to close the page immediately. If you encounter this problem try switching the browser to another one, or even switching to a custom useragent like a Googlebot ua has worked.

Include Referer

The Referer is the URL of the previous web page from which a link to the current page was followed. Some websites use the referrer information to enhance security and prevent direct access from outside sources. Specify the referrer URL to be included in the request headers if needed.

Home