Can a website detect when you are using Selenium with chromedriver?

Может ли веб-сайт определить, когда вы используете Selenium с chromedriver?

Я тестировал Selenium с Chromedriver и заметил, что некоторые страницы могут определить, что вы используете Selenium, даже если там вообще нет автоматизации. Даже когда я просто просматриваю вручную, просто используя Chrome через Selenium и Xephyr, я часто получаю страницу с сообщением об обнаружении подозрительной активности. Я проверил свой пользовательский агент и отпечаток пальца в браузере, и все они в точности идентичны обычному браузеру Chrome.

Когда я захожу на эти сайты в обычном Chrome, все работает нормально, но в тот момент, когда я использую Selenium, я обнаружен.

Теоретически chromedriver и Chrome должны выглядеть буквально одинаково на любом веб-сервере, но каким-то образом они могут это обнаружить.

Если вам нужен тестовый код, попробуйте это:

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=1, size=(1600, 902))
display.start()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--profile-directory=Default')
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-plugins-discovery");
chrome_options.add_argument("--start-maximized")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.delete_all_cookies()
driver.set_window_size(800,800)
driver.set_window_position(0,0)
print 'arguments done'
driver.get('http://stubhub.com')

Если вы просматриваете stubhub, вы будете перенаправлены и "заблокированы" в течение одного или двух запросов. Я расследовал это и не могу понять, как они могут определить, что пользователь использует Selenium.

Как они это делают?

Я установил плагин Selenium IDE в Firefox, и меня забанили, когда я зашел на stubhub.com в обычном браузере Firefox только с дополнительным плагином.

Когда я использую Fiddler для просмотра HTTP-запросов, отправляемых туда и обратно, я заметил, что запросы "поддельного браузера" часто имеют "no-cache" в заголовке ответа.

Такие результаты Есть ли способ определить, что я нахожусь на странице Selenium Webdriver с помощью JavaScript? предполагаю, что не должно быть способа определить, когда вы используете webdriver. Но эти данные свидетельствуют об обратном.

Сайт загружает отпечаток пальца на свои серверы, но я проверил, и отпечаток пальца Selenium идентичен отпечатку пальца при использовании Chrome.

Это одна из полезных загрузок отпечатков пальцев, которые они отправляют на свои серверы:

{"appName":"Netscape","platform":"Linuxx86_64","cookies":1,"syslang":"en-US","userlang":"en-
US","cpu":"","productSub":"20030107","setTimeout":1,"setInterval":1,"plugins":
{"0":"ChromePDFViewer","1":"ShockwaveFlash","2":"WidevineContentDecryptionMo
dule","3":"NativeClient","4":"ChromePDFViewer"},"mimeTypes":
{"0":"application/pdf","1":"ShockwaveFlashapplication/x-shockwave-
flash","2":"FutureSplashPlayerapplication/futuresplash","3":"WidevineContent
DecryptionModuleapplication/x-ppapi-widevine-
cdm","4":"NativeClientExecutableapplication/x-
nacl","5":"PortableNativeClientExecutableapplication/x-
pnacl","6":"PortableDocumentFormatapplication/x-google-chrome-
pdf"},"screen":{"width":1600,"height":900,"colorDepth":24},"fonts":
{"0":"monospace","1":"DejaVuSerif","2":"Georgia","3":"DejaVuSans","4":"Trebu
chetMS","5":"Verdana","6":"AndaleMono","7":"DejaVuSansMono","8":"LiberationM
ono","9":"NimbusMonoL","10":"CourierNew","11":"Courier"}}

Он идентичен в Selenium и в Chrome.

VPN работают для одноразового использования, но они обнаруживаются после загрузки первой страницы. Очевидно, что для обнаружения Selenium выполняется какой-то код JavaScript.

Переведено автоматически

Ответ 1

По сути, способ, которым работает обнаружение Selenium, заключается в том, что они тестируют предопределенные переменные JavaScript, которые появляются при запуске с Selenium. Скрипты обнаружения ботов обычно просматривают все, что содержит слово "selenium" / "webdriver" в любой из переменных (в объекте window), а также переменные документа с именами $cdc_ и $wdc_. Конечно, все это зависит от того, в каком браузере вы используете. Все разные браузеры отображают разные вещи.

For me, I used Chrome, so, all that I had to do was to ensure that $cdc_ didn't exist anymore as a document variable, and voilà (download chromedriver source code, modify chromedriver and re-compile $cdc_ under different name.)

This is the function I modified in chromedriver:

File call_function.js:

function getPageCache(opt_doc) {
  var doc = opt_doc || document;
  //var key = '$cdc_asdjflasutopfhvcZLmcfl_';
  var key = 'randomblabla_';
  if (!(key in doc))
    doc[key] = new Cache();
  return doc[key];
}

(Note the comment. All I did I turned $cdc_ to randomblabla_.)

Here is pseudocode which demonstrates some of the techniques that bot networks might use:

runBotDetection = function () {
    var documentDetectionKeys = [
        "__webdriver_evaluate",
        "__selenium_evaluate",
        "__webdriver_script_function",
        "__webdriver_script_func",
        "__webdriver_script_fn",
        "__fxdriver_evaluate",
        "__driver_unwrapped",
        "__webdriver_unwrapped",
        "__driver_evaluate",
        "__selenium_unwrapped",
        "__fxdriver_unwrapped",
    ];

    var windowDetectionKeys = [
        "_phantom",
        "__nightmare",
        "_selenium",
        "callPhantom",
        "callSelenium",
        "_Selenium_IDE_Recorder",
    ];

    for (const windowDetectionKey in windowDetectionKeys) {
        const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
        if (window[windowDetectionKeyValue]) {
            return true;
        }
    };
    for (const documentDetectionKey in documentDetectionKeys) {
        const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
        if (window['document'][documentDetectionKeyValue]) {
            return true;
        }
    };

    for (const documentKey in window['document']) {
        if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
            return true;
        }
    }

    if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;

    if (window['document']['documentElement']['getAttribute']('selenium')) return true;
    if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
    if (window['document']['documentElement']['getAttribute']('driver')) return true;

    return false;
};

According to answer, there are multiple methods to remove them. One of them is simply opening chromedriver.exe with a HEX-editor and removing all occurences of $cdc_

Ответ 2

Replacing `cdc_` string

You can use Vim or Perl to replace the cdc_ string in chromedriver. See the answer by @Erti-Chris Eelmaa to learn more about that string and how it's a detection point.

Using Vim or Perl prevents you from having to recompile source code or use a hex editor.

Make sure to make a copy of the original chromedriver before attempting to edit it.

Our goal is to alter the cdc_ string, which looks something like $cdc_lasutopfhvcZLmcfl.

The methods below were tested on chromedriver version 2.41.578706.

Using Vim

vim -b /path/to/chromedriver

After running the line above, you'll probably see a bunch of gibberish. Do the following:

Replace all instances of cdc_ with dog_ by typing :%s/cdc_/dog_/g.
- dog_ is just an example. You can choose anything as long as it has the same amount of characters as the search string (e.g., cdc_), otherwise the chromedriver will fail.

To save the changes and quit, type :wq! and press return.
- If you need to quit without saving changes, type :q! and press return.

The -b option tells vim upfront to open the file as a binary, so it won't mess with things like (missing) line endings (especially at the end of the file).

Using Perl

The line below replaces all cdc_ occurrences with dog_. Credit to Vic Seedoubleyew:

perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver

Make sure that the replacement string (e.g., dog_) has the same number of characters as the search string (e.g., cdc_), otherwise the chromedriver will fail.

Wrapping Up

To verify that all occurrences of cdc_ were replaced:

grep "cdc_" /path/to/chromedriver

If no output was returned, the replacement was successful.

Go to the altered chromedriver and double click on it. A terminal window should open up. If you don't see killed in the output, you've successfully altered the driver.

Make sure that the name of the altered chromedriver binary is chromedriver, and that the original binary is either moved from its original location or renamed.

My Experience With This Method

I was previously being detected on a website while trying to log in, but after replacing cdc_ with an equal sized string, I was able to log in. Like others have said though, if you've already been detected, you might get blocked for a plethora of other reasons even after using this method. So you may have to try accessing the site that was detecting you using a VPN, different network, etc.

Ответ 3

As we've already figured out in the question and the posted answers, there is an anti Web-scraping and a bot detection service called "Distil Networks" (which is now "Imperva") in play here. And, according to the company CEO's interview:

Even though they can create new bots, we figured out a way to identify
Selenium the a tool they’re using, so we’re blocking Selenium no
matter how many times they iterate on that bot. We’re doing that now
with Python and a lot of different technologies. Once we see a pattern
emerge from one type of bot, then we work to reverse engineer the
technology they use and identify it as malicious.

It'll take time and additional challenges to understand how exactly they are detecting Selenium, but what can we say for sure at the moment:

it's not related to the actions you take with Selenium. Once you navigate to the site, you get immediately detected and banned. I've tried to add artificial random delays between actions, take a pause after the page is loaded - nothing helped

it's not about browser fingerprint either. I tried it in multiple browsers with clean profiles and not, incognito modes, but nothing helped

since, according to the hint in the interview, this was "reverse engineering", I suspect this is done with some JavaScript code being executed in the browser revealing that this is a browser automated via Selenium WebDriver

I decided to post it as an answer, since clearly:

Can a website detect when you are using selenium with chromedriver?

Yes.

Also, I haven't experimented with older Selenium and older browser versions. In theory, there could be something implemented/added to Selenium at a certain point that Distil Networks bot detector currently relies on. Then, if this is the case, we might detect (yeah, let's detect the detector) at what point/version a relevant change was made, look into changelog and changesets and, may be, this could give us more information on where to look and what is it they use to detect a webdriver-powered browser. It's just a theory that needs to be tested.

Ответ 4

A lot have been analyzed and discussed about a website being detected being driven by Selenium controlled ChromeDriver. Here are my two cents:

According to the article Browser detection using the user agent serving different webpages or services to different browsers is usually not among the best of ideas. The web is meant to be accessible to everyone, regardless of which browser or device an user is using. There are best practices outlined to develop a website to progressively enhance itself based on the feature availability rather than by targeting specific browsers.

However, browsers and standards are not perfect, and there are still some edge cases where some websites still detects the browser and if the browser is driven by Selenium controled WebDriver. Browsers can be detected through different ways and some commonly used mechanisms are as follows:

Implementing captcha / recaptcha to detect the automatic bots.

You can find a relevant detailed discussion in How does recaptcha 3 know I'm using selenium/chromedriver?

Detecting the term HeadlessChrome within headless Chrome UserAgent

You can find a relevant detailed discussion in Access Denied page with headless Chrome on Linux while headed Chrome works on windows using Selenium through Python

Using Bot Management service from Distil Networks

You can find a relevant detailed discussion in Unable to use Selenium to automate Chase site login

Using Bot Manager service from Akamai

You can find a relevant detailed discussion in Dynamic dropdown doesn't populate with auto suggestions on https://www.nseindia.com/ when values are passed using Selenium and Python

Using Bot Protection service from Datadome

You can find a relevant detailed discussion in Website using DataDome gets captcha blocked while scraping using Selenium and Python

However, using the user-agent to detect the browser looks simple but doing it well is in fact a bit tougher.

Note: At this point it's worth to mention that: it's very rarely a good idea to use user agent sniffing. There are always better and more broadly compatible way to address a certain issue.

Considerations for browser detection

The idea behind detecting the browser can be either of the following:

Trying to work around a specific bug in some specific variant or specific version of a webbrowser.

Trying to check for the existence of a specific feature that some browsers don't yet support.

Trying to provide different HTML depending on which browser is being used.

Alternative of browser detection through UserAgents

Some of the alternatives of browser detection are as follows:

Implementing a test to detect how the browser implements the API of a feature and determine how to use it from that. An example was Chrome unflagged experimental lookbehind support in regular expressions.

Adapting the design technique of Progressive enhancement which would involve developing a website in layers, using a bottom-up approach, starting with a simpler layer and improving the capabilities of the site in successive layers, each using more features.

Adapting the top-down approach of Graceful degradation in which we build the best possible site using all the features we want and then tweak it to make it work on older browsers.

Solution

To prevent the Selenium driven WebDriver from getting detected, a niche approach would include either/all of the below mentioned approaches:

Rotating the UserAgent in every execution of your Test Suite using fake_useragent module as follows:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent

options = Options()
ua = UserAgent()
userAgent = ua.random
print(userAgent)
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe')
driver.get("https://www.google.co.in")
driver.quit()

You can find a relevant detailed discussion in Way to change Google Chrome user agent in Selenium?

Rotating the UserAgent in each of your Tests using Network.setUserAgentOverride through execute_cdp_cmd() as follows:

from selenium import webdriver

driver = webdriver.Chrome(executable_path=r'C:\WebDrivers\chromedriver.exe')
print(driver.execute_script("return navigator.userAgent;"))
# Setting user agent as Chrome/83.0.4103.97
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))

You can find a relevant detailed discussion in How to change the User Agent using Selenium and Python

Changing the property value of navigator for webdriver to undefined as follows:

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
  "source": """
    Object.defineProperty(navigator, 'webdriver', {
      get: () => undefined
    })
  """
})

You can find a relevant detailed discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

Changing the values of navigator.plugins, navigator.languages, WebGL, hairline feature, missing image, etc.

You can find a relevant detailed discussion in Is there a version of selenium webdriver that is not detectable?

Changing the conventional Viewport

You can find a relevant detailed discussion in How to bypass Google captcha with Selenium and python?

Dealing with reCAPTCHA

While dealing with 2captcha and recaptcha-v3 rather clicking on checkbox associated to the text I'm not a robot, it may be easier to get authenticated extracting and using the data-sitekey.

You can find a relevant detailed discussion in How to identify the 32 bit data-sitekey of ReCaptcha V2 to obtain a valid response programmatically using Selenium and Python Requests?

tl; dr

You can find a cutting edge solution to evade webdriver detection in:

selenium-stealth - a proven way to evade webdriver detection