Может ли веб-сайт определить, когда вы используете Selenium с chromedriver?
Я тестировал Selenium с Chromedriver и заметил, что некоторые страницы могут определить, что вы используете Selenium, даже если там вообще нет автоматизации. Даже когда я просто просматриваю вручную, просто используя Chrome через Selenium и Xephyr, я часто получаю страницу с сообщением об обнаружении подозрительной активности. Я проверил свой пользовательский агент и отпечаток пальца в браузере, и все они в точности идентичны обычному браузеру Chrome.
Когда я захожу на эти сайты в обычном Chrome, все работает нормально, но в тот момент, когда я использую Selenium, я обнаружен.
Теоретически chromedriver и Chrome должны выглядеть буквально одинаково на любом веб-сервере, но каким-то образом они могут это обнаружить.
Если вам нужен тестовый код, попробуйте это:
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=1, size=(1600, 902))
display.start()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--profile-directory=Default')
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-plugins-discovery");
chrome_options.add_argument("--start-maximized")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.delete_all_cookies()
driver.set_window_size(800,800)
driver.set_window_position(0,0)
print 'arguments done'
driver.get('http://stubhub.com')
Если вы просматриваете stubhub, вы будете перенаправлены и "заблокированы" в течение одного или двух запросов. Я расследовал это и не могу понять, как они могут определить, что пользователь использует Selenium.
Как они это делают?
Я установил плагин Selenium IDE в Firefox, и меня забанили, когда я зашел на stubhub.com в обычном браузере Firefox только с дополнительным плагином.
Когда я использую Fiddler для просмотра HTTP-запросов, отправляемых туда и обратно, я заметил, что запросы "поддельного браузера" часто имеют "no-cache" в заголовке ответа.
Такие результаты Есть ли способ определить, что я нахожусь на странице Selenium Webdriver с помощью JavaScript? предполагаю, что не должно быть способа определить, когда вы используете webdriver. Но эти данные свидетельствуют об обратном.
Сайт загружает отпечаток пальца на свои серверы, но я проверил, и отпечаток пальца Selenium идентичен отпечатку пальца при использовании Chrome.
Это одна из полезных загрузок отпечатков пальцев, которые они отправляют на свои серверы:
{"appName":"Netscape","platform":"Linuxx86_64","cookies":1,"syslang":"en-US","userlang":"en-
US","cpu":"","productSub":"20030107","setTimeout":1,"setInterval":1,"plugins":
{"0":"ChromePDFViewer","1":"ShockwaveFlash","2":"WidevineContentDecryptionMo
dule","3":"NativeClient","4":"ChromePDFViewer"},"mimeTypes":
{"0":"application/pdf","1":"ShockwaveFlashapplication/x-shockwave-
flash","2":"FutureSplashPlayerapplication/futuresplash","3":"WidevineContent
DecryptionModuleapplication/x-ppapi-widevine-
cdm","4":"NativeClientExecutableapplication/x-
nacl","5":"PortableNativeClientExecutableapplication/x-
pnacl","6":"PortableDocumentFormatapplication/x-google-chrome-
pdf"},"screen":{"width":1600,"height":900,"colorDepth":24},"fonts":
{"0":"monospace","1":"DejaVuSerif","2":"Georgia","3":"DejaVuSans","4":"Trebu
chetMS","5":"Verdana","6":"AndaleMono","7":"DejaVuSansMono","8":"LiberationM
ono","9":"NimbusMonoL","10":"CourierNew","11":"Courier"}}
Он идентичен в Selenium и в Chrome.
VPN работают для одноразового использования, но они обнаруживаются после загрузки первой страницы. Очевидно, что для обнаружения Selenium выполняется какой-то код JavaScript.
Переведено автоматически
Ответ 1
По сути, способ, которым работает обнаружение Selenium, заключается в том, что они тестируют предопределенные переменные JavaScript, которые появляются при запуске с Selenium. Скрипты обнаружения ботов обычно просматривают все, что содержит слово "selenium" / "webdriver" в любой из переменных (в объекте window), а также переменные документа с именами $cdc_
и $wdc_
. Конечно, все это зависит от того, в каком браузере вы используете. Все разные браузеры отображают разные вещи.
For me, I used Chrome, so, all that I had to do was to ensure that $cdc_
didn't exist anymore as a document variable, and voilà (download chromedriver source code, modify chromedriver and re-compile $cdc_
under different name.)
This is the function I modified in chromedriver:
File call_function.js:
function getPageCache(opt_doc) {
var doc = opt_doc || document;
//var key = '$cdc_asdjflasutopfhvcZLmcfl_';
var key = 'randomblabla_';
if (!(key in doc))
doc[key] = new Cache();
return doc[key];
}
(Note the comment. All I did I turned $cdc_
to randomblabla_
.)
Here is pseudocode which demonstrates some of the techniques that bot networks might use:
runBotDetection = function () {
var documentDetectionKeys = [
"__webdriver_evaluate",
"__selenium_evaluate",
"__webdriver_script_function",
"__webdriver_script_func",
"__webdriver_script_fn",
"__fxdriver_evaluate",
"__driver_unwrapped",
"__webdriver_unwrapped",
"__driver_evaluate",
"__selenium_unwrapped",
"__fxdriver_unwrapped",
];
var windowDetectionKeys = [
"_phantom",
"__nightmare",
"_selenium",
"callPhantom",
"callSelenium",
"_Selenium_IDE_Recorder",
];
for (const windowDetectionKey in windowDetectionKeys) {
const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
if (window[windowDetectionKeyValue]) {
return true;
}
};
for (const documentDetectionKey in documentDetectionKeys) {
const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
if (window['document'][documentDetectionKeyValue]) {
return true;
}
};
for (const documentKey in window['document']) {
if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
return true;
}
}
if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;
if (window['document']['documentElement']['getAttribute']('selenium')) return true;
if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
if (window['document']['documentElement']['getAttribute']('driver')) return true;
return false;
};
According to answer, there are multiple methods to remove them. One of them is simply opening chromedriver.exe
with a HEX-editor and removing all occurences of $cdc_
Ответ 2
Replacing cdc_
string
You can use Vim or Perl to replace the cdc_
string in chromedriver
. See the answer by @Erti-Chris Eelmaa to learn more about that string and how it's a detection point.
Using Vim or Perl prevents you from having to recompile source code or use a hex editor.
Make sure to make a copy of the original chromedriver
before attempting to edit it.
Our goal is to alter the cdc_
string, which looks something like $cdc_lasutopfhvcZLmcfl
.
The methods below were tested on chromedriver version 2.41.578706
.
Using Vim
vim -b /path/to/chromedriver
After running the line above, you'll probably see a bunch of gibberish. Do the following:
- Replace all instances of
cdc_
withdog_
by typing:%s/cdc_/dog_/g
.dog_
is just an example. You can choose anything as long as it has the same amount of characters as the search string (e.g.,cdc_
), otherwise thechromedriver
will fail.
- To save the changes and quit, type
:wq!
and pressreturn
.- If you need to quit without saving changes, type
:q!
and pressreturn
.
- If you need to quit without saving changes, type
The -b
option tells vim upfront to open the file as a binary, so it won't mess with things like (missing) line endings (especially at the end of the file).
Using Perl
The line below replaces all cdc_
occurrences with dog_
. Credit to Vic Seedoubleyew:
perl -pi -e 's/cdc_/dog_/g' /path/to/chromedriver
Make sure that the replacement string (e.g., dog_
) has the same number of characters as the search string (e.g., cdc_
), otherwise the chromedriver
will fail.
Wrapping Up
To verify that all occurrences of cdc_
were replaced:
grep "cdc_" /path/to/chromedriver
If no output was returned, the replacement was successful.
Go to the altered chromedriver
and double click on it. A terminal window should open up. If you don't see killed
in the output, you've successfully altered the driver.
Make sure that the name of the altered chromedriver
binary is chromedriver
, and that the original binary is either moved from its original location or renamed.
My Experience With This Method
I was previously being detected on a website while trying to log in, but after replacing cdc_
with an equal sized string, I was able to log in. Like others have said though, if you've already been detected, you might get blocked for a plethora of other reasons even after using this method. So you may have to try accessing the site that was detecting you using a VPN, different network, etc.
Ответ 3
As we've already figured out in the question and the posted answers, there is an anti Web-scraping and a bot detection service called "Distil Networks" (which is now "Imperva") in play here. And, according to the company CEO's interview:
Even though they can create new bots, we figured out a way to identify
Selenium the a tool they’re using, so we’re blocking Selenium no
matter how many times they iterate on that bot. We’re doing that now
with Python and a lot of different technologies. Once we see a pattern
emerge from one type of bot, then we work to reverse engineer the
technology they use and identify it as malicious.
It'll take time and additional challenges to understand how exactly they are detecting Selenium, but what can we say for sure at the moment:
- it's not related to the actions you take with Selenium. Once you navigate to the site, you get immediately detected and banned. I've tried to add artificial random delays between actions, take a pause after the page is loaded - nothing helped
- it's not about browser fingerprint either. I tried it in multiple browsers with clean profiles and not, incognito modes, but nothing helped
- since, according to the hint in the interview, this was "reverse engineering", I suspect this is done with some JavaScript code being executed in the browser revealing that this is a browser automated via Selenium WebDriver
I decided to post it as an answer, since clearly:
Can a website detect when you are using selenium with chromedriver?
Yes.
Also, I haven't experimented with older Selenium and older browser versions. In theory, there could be something implemented/added to Selenium at a certain point that Distil Networks bot detector currently relies on. Then, if this is the case, we might detect (yeah, let's detect the detector) at what point/version a relevant change was made, look into changelog and changesets and, may be, this could give us more information on where to look and what is it they use to detect a webdriver-powered browser. It's just a theory that needs to be tested.
Ответ 4
A lot have been analyzed and discussed about a website being detected being driven by Selenium controlled ChromeDriver. Here are my two cents:
According to the article Browser detection using the user agent serving different webpages or services to different browsers is usually not among the best of ideas. The web is meant to be accessible to everyone, regardless of which browser or device an user is using. There are best practices outlined to develop a website to progressively enhance itself based on the feature availability rather than by targeting specific browsers.
However, browsers and standards are not perfect, and there are still some edge cases where some websites still detects the browser and if the browser is driven by Selenium controled WebDriver. Browsers can be detected through different ways and some commonly used mechanisms are as follows:
You can find a relevant detailed discussion in How does recaptcha 3 know I'm using selenium/chromedriver?
- Detecting the term HeadlessChrome within headless Chrome UserAgent
You can find a relevant detailed discussion in Access Denied page with headless Chrome on Linux while headed Chrome works on windows using Selenium through Python
- Using Bot Management service from Distil Networks
You can find a relevant detailed discussion in Unable to use Selenium to automate Chase site login
- Using Bot Manager service from Akamai
You can find a relevant detailed discussion in Dynamic dropdown doesn't populate with auto suggestions on https://www.nseindia.com/ when values are passed using Selenium and Python
- Using Bot Protection service from Datadome
You can find a relevant detailed discussion in Website using DataDome gets captcha blocked while scraping using Selenium and Python
However, using the user-agent to detect the browser looks simple but doing it well is in fact a bit tougher.
Note: At this point it's worth to mention that: it's very rarely a good idea to use user agent sniffing. There are always better and more broadly compatible way to address a certain issue.
Considerations for browser detection
The idea behind detecting the browser can be either of the following:
- Trying to work around a specific bug in some specific variant or specific version of a webbrowser.
- Trying to check for the existence of a specific feature that some browsers don't yet support.
- Trying to provide different HTML depending on which browser is being used.
Alternative of browser detection through UserAgents
Some of the alternatives of browser detection are as follows:
- Implementing a test to detect how the browser implements the API of a feature and determine how to use it from that. An example was Chrome unflagged experimental lookbehind support in regular expressions.
- Adapting the design technique of Progressive enhancement which would involve developing a website in layers, using a bottom-up approach, starting with a simpler layer and improving the capabilities of the site in successive layers, each using more features.
- Adapting the top-down approach of Graceful degradation in which we build the best possible site using all the features we want and then tweak it to make it work on older browsers.
Solution
To prevent the Selenium driven WebDriver from getting detected, a niche approach would include either/all of the below mentioned approaches:
Rotating the UserAgent in every execution of your Test Suite using
fake_useragent
module as follows:from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
options = Options()
ua = UserAgent()
userAgent = ua.random
print(userAgent)
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\ChromeDriver\chromedriver_win32\chromedriver.exe')
driver.get("https://www.google.co.in")
driver.quit()
You can find a relevant detailed discussion in Way to change Google Chrome user agent in Selenium?
Rotating the UserAgent in each of your Tests using
Network.setUserAgentOverride
throughexecute_cdp_cmd()
as follows:from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\WebDrivers\chromedriver.exe')
print(driver.execute_script("return navigator.userAgent;"))
# Setting user agent as Chrome/83.0.4103.97
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
You can find a relevant detailed discussion in How to change the User Agent using Selenium and Python
Changing the property value of
navigator
for webdriver toundefined
as follows:driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
You can find a relevant detailed discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
- Changing the values of
navigator.plugins
,navigator.languages
, WebGL, hairline feature, missing image, etc.
You can find a relevant detailed discussion in Is there a version of selenium webdriver that is not detectable?
- Changing the conventional Viewport
You can find a relevant detailed discussion in How to bypass Google captcha with Selenium and python?
Dealing with reCAPTCHA
While dealing with 2captcha and recaptcha-v3 rather clicking on checkbox associated to the text I'm not a robot, it may be easier to get authenticated extracting and using the data-sitekey
.
You can find a relevant detailed discussion in How to identify the 32 bit data-sitekey of ReCaptcha V2 to obtain a valid response programmatically using Selenium and Python Requests?
tl; dr
You can find a cutting edge solution to evade webdriver detection in: