How to reduce memory usage on backtest/hyperopt

Breaking news, in case you don’t know, hyperopt is just backtest(s) done in parallel. Okay, now to the real stuff.

When you are doing backtest/hyperopt, the candles’ data and dataframe are both store into the memory instead of your hard disk. That’s why you need to make sure your strategy is efficient in using the memory, especially if you want to backtest/hyperopt using plenty of coins on a wide timerange with a strategy that have plenty of indicators. From my experience, people (unknowingly) waste the memory storage on redundant things, such as

Store unused data into the dataframe

Take this snippet for example

bb_40 = qtpylib.bollinger_bands(dataframe['close'], window=40, stds=2)
dataframe['lower'] = bb_40['lower']
dataframe['mid'] = bb_40['mid']
dataframe['bbdelta'] = (dataframe['mid'] - dataframe['lower']).abs()

The code above gonna store 3 new columns into the dataframe for every coins. It’s fine if you actually use all 3 of them in your strategy. But this particular strategy apparently don’t use mid value at all, apart from using it to calculate bbdelta. So you are filling up your memory with 1 unused column. So instead of storing mid into dataframe, you can just write the code above like this

bb_40 = qtpylib.bollinger_bands(dataframe['close'], window=40, stds=2)
dataframe['lower'] = bb_40['lower']
dataframe['bbdelta'] = (bb_40['mid'] - dataframe['lower']).abs()

Now you are only storing 2 columns instead of 3. The free space can be used to backtest slightly more coins and/or wider timerange, for example. Another example of bad practice is this snippet

dataframe['sma_5'] = ta.SMA(dataframe, timeperiod=5)
dataframe['lips'] = ta.SMA(dataframe, timeperiod=5)

The strategy wrote duplicate column that have same value as another column. Same as previous example, you are wasting valuable memory storage. So when you write your strategy, make sure be mindful when you want to store something into the dataframe. It might give you significant saving of memory usage.

Putting loops inside populate_indicators (for hyperopt)

Let’s say for example you want to hyperopt the length of EMA to be used on entry, the value is between 5 to 15. There are 2 ways of doing it.

Generating the columns on populate_indicators

People either do this manually or write a loop to generate columns of ema_5 to ema_15 into the dataframe. The issue with this method is you are basically generating 10 unused columns (assuming you are using them for entry logic only) that occupies your memory. Better way to do it (in my opinion) is…

Generate the column on populate_trade_entry

Create a hyperopt param for the length, for example

buy_length_ema = IntParameter(5, 15, default=5, optimize=True)

and then in the populate_entry_trend

def populate_entry_trend(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
    dataframe['ema_buy'] = ta.EMA(dataframe, int(self.buy_length_ema.value))

Instead of create 11 columns, now you are gonna create just 1 column while still be able to hyperopt the length parameter for ema_buy. You will save a lot of memory.

If you have done both points above, but you want more reduction of memory usage and you are okay with longer hyperopt finish time, then you can do these two methods

Limit the job count to 1 (for hyperopt)

As said in the beginning, hyperopt is just backtests done in parallel. Sadly, they won’t share the dataframe between each of the backtests. That means the more CPU core you use for it, the more memory you use. To use the least memory possible, you will need to limit the CPU core used to 1.

Calculate only-used-once indicators inside their respective functions

For example, you only use ema_15 for your entry logic. Instead of calculating the indicator inside populate_indicators and store the value inside dataframe (which means it’s stored in the memory), you can just calculate it inside populate_entry_trend like this

def populate_entry_trend(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
    ema_15 = ta.EMA(dataframe, 15)

You will save memory this way, but you gonna have longer finish time because that means ema_15 gonna be calculated for every epochs, instead of just once if you put it inside populate_indicators. For backtest, the trade-off of longer finish time might be insignificant, since you only calculate it once either way.


2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *