I’m trying to set up a scenario that detects unwanted crawlers on a B2C shop site. Characteristics are:
Crawlers visit multiple country shops, identified by locale/currency in the request URI
Crawlers usually do not request static content like CSS and JS resources
Crawlers identify with a multitude of different user agent strings
I have built a scenario of type “conditional” that evaluates the corresponding event properties and fires if the number of country sites visited is above a threshold and we have multiple user agents and no static content.
So far, so good… but since this is a B2C site, we have thousands of client IPs active at the same time, and in order to evaluate the above conditions, I need quite a lot of events in my buckets. This leads to being quite resource consuming.
Now, there is a property cache_size that allows to limit the number of events kept in memory for a bucket. The question I’m facing is: when I access queue.Queue in my condition and evaluate some properties of the queue elements, will the queue then only contain cache_size elements? In that case, my condition will not work any more.
So… does anyone know how cache_size plays together with queue.Queue in conditional buckets?
Your queue will be limited by the cache_size so whatever the value will be the max N of the queue.
However, what about a leaky bucket if we could see what data structure we was working with we could use distinct to get the local/currency plus useragent then we can use the cancel_on property to cancel the overflow if they requested a static resource, however, this may not always be cancelled depending on user requests.
note: we can use distinct with cache_size to limit memory as the disctinct is kept when other events are flushed by cache_size meaning we can limit memory and keep track of useragent plus locale
The relevant data is basically all kept in the metadata. We have meta.locale_currency which holds something like /de_DE/EUR and meta.http_user_agent which holds the user agent string. I think we could use the concatenation of locale_currency + http_user_agent as distinct attribute. However, then it’s not possible to express a condition like “more than 10 distinct locale_currency values AND more than 2 user agents” - it’s more like defining a threshold of n, and you can reach that with one locale_currency x n user agents or with n locale_currency x one user agent or everything in between. Not quite what I want to express, but maybe good enough and definitely resource saving.
Another thought was: is it possible to have a special (anti-)enrichment parser that kicks out the unmarshaled map as well as the original log line from the event? I mean, all relevant data is copied to meta anyways and this should reduce the event size very much. Do you know if it is possible?