Limit cache size on conditional scenarios

Hello,

I’m trying to set up a scenario that detects unwanted crawlers on a B2C shop site. Characteristics are:

  • Crawlers visit multiple country shops, identified by locale/currency in the request URI
  • Crawlers usually do not request static content like CSS and JS resources
  • Crawlers identify with a multitude of different user agent strings

I have built a scenario of type “conditional” that evaluates the corresponding event properties and fires if the number of country sites visited is above a threshold and we have multiple user agents and no static content.

So far, so good… but since this is a B2C site, we have thousands of client IPs active at the same time, and in order to evaluate the above conditions, I need quite a lot of events in my buckets. This leads to being quite resource consuming.

Now, there is a property cache_size that allows to limit the number of events kept in memory for a bucket. The question I’m facing is: when I access queue.Queue in my condition and evaluate some properties of the queue elements, will the queue then only contain cache_size elements? In that case, my condition will not work any more.

So… does anyone know how cache_size plays together with queue.Queue in conditional buckets?

Thanks for any insights!

Cheers
Albrecht

Your queue will be limited by the cache_size so whatever the value will be the max N of the queue.

However, what about a leaky bucket if we could see what data structure we was working with we could use distinct to get the local/currency plus useragent then we can use the cancel_on property to cancel the overflow if they requested a static resource, however, this may not always be cancelled depending on user requests.

note: we can use distinct with cache_size to limit memory as the disctinct is kept when other events are flushed by cache_size meaning we can limit memory and keep track of useragent plus locale

Hi Laurence,
thanks for your reply!

The relevant data is basically all kept in the metadata. We have meta.locale_currency which holds something like /de_DE/EUR and meta.http_user_agent which holds the user agent string. I think we could use the concatenation of locale_currency + http_user_agent as distinct attribute. However, then it’s not possible to express a condition like “more than 10 distinct locale_currency values AND more than 2 user agents” - it’s more like defining a threshold of n, and you can reach that with one locale_currency x n user agents or with n locale_currency x one user agent or everything in between. Not quite what I want to express, but maybe good enough and definitely resource saving.

Another thought was: is it possible to have a special (anti-)enrichment parser that kicks out the unmarshaled map as well as the original log line from the event? I mean, all relevant data is copied to meta anyways and this should reduce the event size very much. Do you know if it is possible?

Regards
Albrecht