- Updated Wednesday, July 20th 2016 @ 12:01:16
I make the new-version play around 1000 games vs the best-version (usually the one selected in here), alternating who goes first. It usually gives a good confidence interval (±2.5%).
- Created Wednesday, July 20th 2016 @ 16:20:55
I like to use the last several versions, especially if I've made several changes. I started doing this a couple of competitions back because I noticed sometimes my new change would crush my most recent version but be worse against a version or two back.
I mostly noticed it when optimizing with something like a genetic algorithm that I didn't directly control or necessarily understand the "meaning" of the changes. The optimization would focus on a weak spot and I would end up with something overly specialized that didn't perform well against a wide variety of players.
- Created Wednesday, July 20th 2016 @ 18:29:22
I do similar to ghooo (2k games), though I do artificially limit how much work is being done per turn. This is done partly so I don't have to care too much about performance of an experiment. It is also useful in pure optimization changes too, as a regression test (expecting the two versions to come out even).
I also have both micro and macro benchmarks. The macro ones I track and chart over time. They help me to tune some time management parameters.
- Created Thursday, July 21st 2016 @ 17:24:02
I agree with Melsonator : the tests are more valid with various opponents, but older versions are often too weak to be meaningful opponents. The number of games depends on the expected improvement. For tuning parameters, i currently use 10000 games. I spawn several tournaments at same time on my 8-core computer and use fixed node number per move mode to avoid any CPU stress effect on the results (like ChickenCoop if i understood well). Another important aspect : how do you randomize games, ie ensure the game will not repeat themselves ? I personnally use 5-move openings (with eliminated symetries).
- Created Friday, July 22nd 2016 @ 00:50:50
@melsonator, I think testing vs your older versions would essentially produce more meaningful results especially when optimizing parameters. I also used to test against some strong bots available online where I found this Android app to be the strongest https://play.google.com/store/apps/details?id=com.magmamobile.game.UltimateTicTacToe&hl=en
@ChickenCoop, I don't get how you artificially limit the work being done, if you can give some examples that would be awesome. What I do is I give both bots 14000ms for the whole game, 700ms per move (because my machine is slower than theaigames server, so i need to give the bots more time). Also If you can elaborate more on the usage of the mini/macro benchmarks.
@NotABug, for the negamax to get random moves every game, I randomize the moves before testing them in the requestMove function (requestMove is the function the calls the negamax, and gets the best move). This way if I have many moves that evaluate essentially to the same value, I select a random one of them. To make things more interesting I decrease the range of the evaluation function for example if it is from -10000 to 10000, I decrease it to be -100 to 100 so more moves will be evaluated to the same value (think of it as buckets and I am increasing the bucket size when I decrease the range)
I assume 5-move openings is trying all combinations of the starting 5-moves. Well I can see a problem here is that you might be exploring very bad scenarios that might not happen in an actual game, and it might lead to noisy results if these bad scenarios are the majority of your test games.
- Created Sunday, July 24th 2016 @ 07:07:57
@ghooo My bot uses an "anytime" algorithm, and can be limited by the number of 'rounds' in that algorithm. When I do this, I also disable time tracking in my evaluator, because I run it on a vps which doesn't always have consistent performance.
Early on, I used micro benchmarks to test small changes to my code, though those became irrelevant fairly early on. With my macro benchmarks (which play out the first several moves back and forth), I do two things. One, I keep track of the time it takes to complete the operation (several seconds) and make sure it is going down. Two, I use the cpu profile built when running the benchmarks to find hot spots. When I was only using micro benchmarks (on a single function), I would often make changes that would speed up one function at the cost of another. The larger view (with profiling) has helped me prioritize better. Golang comes with built in benchmarking and profiling support. https://blog.golang.org/profiling-go-programs