Dec 7, 2011

[Info] Silliness again

I've been writing a paper for an evening course I've been taking, related to chess engine searches.

To get a clean output of the search I had to turn off most of the features, like killer moves, PVS search, aspiration windows etc.

Funny thing, when I was going to turn off the futility pruning, I noticed it was already turned off... :) I apparently accidentally returned false for "use futility pruning" even when it met the requirements.

That means Mediocre v0.4 is playing without it.

I ran a quick 128 game test and turning it on seems to gain some 30-40 elo points in self play. Not too huge, but definitely silly to not have.

I'll be sure to turn it on again in the next release. :)

Nov 29, 2011

[Info] Jim Ablett's compile of Mediocre v0.4

Jim has compiled Mediocre v0.4 and I added it to my sourceforge page.

I haven't had time to test it myself, but previous experience has it that Jim's compiles are far stronger than the Java version, so I'd recommend using that.

Jim's page

Mediocre v0.4 JA compile

Nov 27, 2011

[New Version] v0.4 - Ponder, revamped search, UCI only

Changes:
  • Any hash move used is now verified, this fixes a very rare occurrence of Mediocre crashing
  • The transposition table is now using the full 64 bit zobrist keys
  • The search was completely rewritten, possibly catching some bugs. Should show help quite a bit in playing strength
  • Ponder implemented
  • Removed the dependency of a settings file, things like hash sizes are now done through the UCI protocol
  • Removed the semi-working xboard protocol entirely. Sorry.

Note: This version is notably stronger than version 0.34, mainly due to bugfixes in the search.

Mediocre is as mentioned an UCI only engine from here on. This also means I've removed old settings file, use the UCI settings commands mentioned in the readme file.

Download here

Nov 25, 2011

[Info] Testing results

So some testing to confirm I didn't do anything silly.

M1-1 is a version with 64 bit zobrist keys in the transposition table, removal of the notion of "row" and some evaluation fixes. But without the tapered eval. (see previous posts for more info)

Against the Mediocre v1.0 beta it turned out like this:


Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2401 6 6 11029 50.4 % 2399 24.5 %
2 M1B : 2399 6 6 11029 49.6 % 2401 24.5 %


So pretty much equal, which is good enough. The worst scenario here would be the beta version being slightly stronger, but that should only be at most with a few elo points.

And against some other engines just to confirm.


Program Elo + - Games Score Av.Op. Draws
1 counter : 2593 15 15 2048 76.4 % 2389 23.4 %
2 M1-1 : 2392 8 8 6154 48.2 % 2405 14.4 %
3 adam : 2337 14 15 2048 42.5 % 2389 9.3 %
4 bikjump : 2294 15 15 2048 36.6 % 2389 10.6 %

Program Elo + - Games Score Av.Op. Draws
1 counter : 2580 14 14 2048 75.1 % 2388 25.4 %
2 M1B : 2390 8 8 5854 47.5 % 2407 15.8 %
3 adam : 2343 14 14 2048 43.7 % 2388 9.3 %
4 bikjump : 2290 16 16 1748 36.4 % 2388 12.1 %

The newer version seems to be holding up.

I'll release a new version with this during the weekend, probably on Sunday.

Then I have a steady foundation to start tackling the evaluation again.

Nov 23, 2011

[Info] So wrong again, but at least closer

So yeah, my imagined strength increase mentioned in the last post was non-existent of course.

But, the tapered eval seems to be holding up as the culprit of my recent failures.

I've tried to zone in on the exact version after Mediocre v1.0 Beta that did the best. With all kinds of combinations with and without 64 bit hash tables, tapered eval and removal of the notion of "row".

The results are... inconclusive.

However, it seems a version with everything except the specific addition of tapered eval seems to be playing at least equal with the beta version. So I think I'll just go with that one. Do a new release (to get a firm base to build from). And then start with my evaluation tampering.

I'll post some testing results in a day or two. (not going to leave any doubt this time)

Nov 18, 2011

[Info] Importance of thorough testing

Lately I've been struggling with one of those "super versions" that seems to beat everything I throw at it.

When I got done with my search improvements I did some really extensive testing against Mediocre v0.34 and concluded the new version to have pretty much exactly 60% win rate against it.

So I tagged that version and called it Mediocre v1.0 beta.

Then I committed three things to the trunk of svn: renaming of row to file, tapered eval and the change from 32 bit to 64 bit keys in the transposition table (along with a sanity check of all tt moves).

I thought I'd tried all of these extensively, scoring more or less equal to v1.0 beta, which I deemed ok since the changes were more or less needed for readability, stability, and future work.

During the passed weeks any change I did, no matter how tiny it seemed, got slaughtered by 1.0 beta. All my evaluation tweaking seemed to give results, but against 1.0 beta it still lost.

Now, the newer (uncommitted) versions had some utility changes that I really wanted to have committed (things like the mirror evaluation test). So I took those changes and added them to the 1.0 beta tag one by one, testing quite extensively between every change.

After I'd moved over all the utility, I thought I might just as well try the three things I'd committed after 1.0 beta. This is how that testing went:

  1. Row to file change: This should just have been a readability change (the usage of "row" had lingered around since the very first version of Mediocre, while the correct terminology is of course "file"). But it turned out while doing this I'd changed the rank, file and distance methods to static (rather than instance methods). This seems to be a very good move since they're called a lot, and suddenly 1.0 beta was playing better, quite a bit better.

  2. 32 to 64 keys and hash move validation: I thought if anything, this would be the culprit since messing around with the transposition tables is very likely to introduce bugs. Now when re-adding it, it seems to give a tiny but noticeable strength increase..

  3. Tapered eval: Horrible horrible reduction in strength. I have no idea how I missed this, but it seems to completely ruin the evaluation. Here's the actual culprit and I'll be much more careful when trying to put it back.


So the moral of the story. Never assume you did enough testing if you see signs that you didn't.

Nov 14, 2011

[Tournament] GECCO - Final results


1 Spike wwbwbw xrtnbd 111==1 5
2 Nightmare wbwbbw ctgsrb 1=1=1= 4.5
3 Tornado bwwbbw bnsdgm 1=0111 4.5
4 Rookie -bwbwb msdbnc 101=01 3.5
5 Baron wbbwwb tgmrsn 0=1=== 3
6 Goldbar wwbbwb dbnctx ==0101 3
7 Deuterium bwbwwb gxrtcs =10010 2.5
8 Mediocre -bw-bb rcbxxt 010010 2
9 Spartacus bwbwbw nmxgdr 001000 1
10 micro-Max bbw-ww sdcmmg 000100 1

Not what I'd hoped for, but with two forfeits I guess that's what I deserve. Atleast Mediocre won the two games it should and played very well against The Baron, while pretty horrible against Tornado.

Next time Mediocre will be in the top half. :)

[Tournament] GECCO - Game 6

Bit unlucky with the pairing and got Tornado here. Mediocre had the bishop pair and felt quite comfortable but underestimated the insanely strong white knight that ultimately lead to an unstoppable pair of passed pawns. Not much to say about this loss, Tornado was just better.

[Tournament] GECCO - Game 5

A second chance against micromax. Started out a bit crazy and then turned in to an endgame where Mediocre had the upper hand from the start.

[Tournament] GECCO - Game 4

Forfeit against micromax... yeah I overslept (and was a bit hungover after a late saturday night...), was connected to the server but for some reason Mediocre couldn't start the game. No idea why.

Nov 12, 2011

[Tournament] GECCO - Standings day 1

    Name              Rating Score Perfrm Upset  Results 
------------- ------ ----- ------ ------ -------
1 +Spike [1872] 3.0 [2168] [ 10] +10w +03w +04b
2 +Nightmare [1833] 2.5 [2060] [ 24] +08w =04b +07w
3 +Rookie [1747] 2.0 [1874] [ 0] +09w -01b +06w
4 -Tornado [1882] 1.5 [1793] [ 0] +05b =02w -01w
5 +Baron [ 0] 1.5 [1793] [2587] -04w =07b +09b
6 -Deuterium [ 0] 1.5 [1748] [2587] =07b +10w -03b
7 -Goldbar [1824] 1.0 [1594] [ 0] =06w =05w -02b
8 -Spartacus [ 0] 1.0 [1594] [1675] -02b -09w +10b
9 +Mediocre [ 0] 1.0 [1565] [1675] -03b +08b -05w
10 -microMax [ 0] 0.0 [1340] [ 0] -01b -06b -08w

Mediocre's walkover was against Rookie which I really thought I had a chance against. Too bad.

I guess MicroMax should be possible to beat and then we'll see what the other opponents are. Looking at the board it would be Deuterium and Goldbar. With some luck perhaps a 4.0 score isn't too impossible.

We'll see tomorrow.

[Tournament] GECCO - Game 3

Game 3 underway against The Baron. Have no high hopes for this one. :)

-

A solid loss as expected, but Mediocre played quite well I'd have to say. Ended up with some over extended pawns and the kings on the wrong side (The Baron had a pawn majority on the queenside, making the pawn ending a really simple win).



Game 4 starts tomorrow at 8:30 CET.

[Tournament] GECCO - Game 2

Spartacus played weird in the end game but still almost held the draw due to opposite colored bishops.

I have only a 20% adjustment towards draw for opposite bishops.. might be slightly too little, but rather too little than too much I guess.


[Tournament] Mediocre in GECCO

Mediocre is participating in a long time control tournament today and tomorrow.

http://marcelk.net/chess/GECCO/2011/GECCC.html

Unfortunately I had connection issues during the first game and had to forfeit it. Second game now, against Spartacus, seems everything is going fine, 14 moves in and Mediocre says up with +1.75. :)

I'm using a few weeks old version of Mediocre, with the changes to search but none of the recent evaluation dabbling.

Nov 7, 2011

[Info] Yay me

Up to my 10th failed attempt at tuning my passed pawn eval.

The last attempt I wasted 20,000 games.

I have tables looking like this:

Rank: 1 2 3 4 5 6 7 8
Value: {0,10,20,30,60,120,150,0}

That is increasingly higher evaluation the closer the passer is to promotion.

This table can than be stretched in all kinds of directions during the tuning (increasing/decreasing all values, or increasing the differences between them) using two "knobs", so the table only needs two values to tune instead of six.

Now, I had reversed the values when preparing for the tuning... so instead of giving 150 centipawns for being one square from queening I gave it 10.

The tuning tried to compensate and the best it came up with was:

Rank: 1 2 3 4 5 6 7 8
Value: {0,-83,-60,-14,-9,17,25,0}

Quite good effort, but I find it hard to believe a 8 cp difference for 6th and 7th rank is optimal.

Fixed the problem and running the tuning for the 11th time. :)

Nov 6, 2011

[Info] Fun little problem

Still tuning, moved on to passed pawn eval which is another of those problem areas (I've always thought Mediocre neglected passed pawns, but any attempts to manually tune it has resulted in heavily overvaluing them).

While running my tests I ran into a little interesting problem. Since I have aspiration windows it's quite common that researches occur (when the result is outside the window, i.e. a sudden drop/rise in evaluation between iterations).

Now if not careful it's possible that this window bounces back and forth, i.e. failing too low, then too high and never getting passed the iteration since it keeps researching.

I have all those security measures in place, but with the extreme numbers in evaluation the tuning can come up with, the score was so high it surpassed the "infinity" score. Now this will obviously fix itself after a few iterations of the testing (giving a passed pawn the value of 20 queens is probably not going to help), but my aspiration windows went berserk since I check it like this:

if(eval <= alpha) {
...
} else if (eval >= beta) {
...
}

With alpha and beta set to - and + "infinity" we have the maximum window that can never cause a research (mate in 1 is lower than infinity obviously). But as I said with these extreme evaluation parameters it did.

Easy fix, just a bit silly and quite hard to find.

Nov 5, 2011

[Info] CLOP windows executable

Rémi Coulom was nice enough to create a windows executable of CLOP (I ran it through the Qt Creator before which was a bit silly).

If you haven't tried his software before I urge you to do it, it's quite aweosme:

http://remi.coulom.free.fr/CLOP/

Some changes in this version as well:

2011-11-05: 0.0.9
  • Stronger regularization (avoid overfitting in high dimensions)
  • "Merge Replications" option in gui -> faster, better display
  • Performance optimization of display and loading of large data files
  • Removed "-ansi" option for Windows compilation
  • Shrinking parameter ranges does not lose integer data any more
  • Removed confusing columns: max-1, max-2, ...
  • More explanations in the doc: biased win rate + GammaParameter

Link to the release post here.

[Info] The results are in

So I've run the mobility tuning overnight, with more than 10,000 games (probably too little for extreme accuracy, but good enough for me).

All of the four parameters (mentioned in my last post) started to zone in on the values very well. For example the first parameter looks like this:


The mean value was -370 which can clearly be seen in the plot.

So on to all of the results and some comparisons (sanity checks I guess). These were the resulting values (meaning of these explained in my last post):

  • MOBILITY_SAFE_MULTI = -370

  • MOBILITY_UNSAFE_MULTI = 783

  • MOBILITY_ONE_TRAPPED_MULTI = 767

  • MOBILITY_ZERO_TRAPPED_MULTI = 1343


I'll do four different examples.

  1. Average piece - a piece with 4 safe squares and 1 unsafe

  2. Good piece - a piece with 12 safe squares and 3 unsafe

  3. Bad piece - a piece on the fourth rank with 1 safe square and 1 unsafe

  4. Trapped piece - a piece on the fourth rank with 0 safe squares and 1 unsafe


So comparisons (using the examples above):

  1. 4*2+5 = 13 (before tuning)
    -370*4/100+783*5/100 = 25 (after tuning)

  2. 12*2+15 = 39
    -370*12/100+783*15/100 = 73

  3. 1*2+1-4*5/2 = -7
    -370*1/100+783*2/100-767*4/100 = -18

  4. 0*2+1-4*5 = -19
    -370*0/100+783*1/100-1343*4/100 = -46


So very reasonable numbers, just a bit higher in all directions just as I suspected (always nice when you have a guess and testing confirms it).

First interesting thing is the negative safe square number. I'm sure it's not trying to penalize safe squares, but rather put all bonus in the total squares (as those include safe squares as well), meaning safe vs unsafe squares is really not that important.

Second thing is my apparently good guestimate of giving a piece with one square half the penalty of a piece with zero squares. Which the tuning seems to confirm by almost doubling the factor (767 to 1343).

-

Time to run a test with all these new values. Will be very interesting.

Nov 4, 2011

[Info] The mobility tuning

I think the results of the mobility tuning is going to be quite interesting (and hopefully useful), so I'm going to scribble down the specifics of the test.

I'm going to test the following parameters:

MOBILITY_SAFE_MULTI
Every safe square that a piece can reach on the board, time this constant, divided by 100.

So basically:

MOBILITY_SAFE_MULTI = 500
Safe squares = 4
Equals 500*4/100 = 20 centipawns bonus

MOBILITY_UNSAFE_MULTI
Every safe squares, plus every nonsafe square (i.e. squares that are protected by less valued pieces). Same equation as above.

MOBILITY_ONE_TRAPPED_MULTI
If the piece on has one safe square it's penalized by the rank it's on plus 1. (and weighed as above)

MOBILITY_ZERO_TRAPPED_MULTI
If the piece has no safe squares it's penalized by the rank it's on plus 1. (and weighed as above)


After 1100 games I got the following numbers

MOBILITY_SAFE_MULTI = -141
MOBILITY_UNSAFE_MULTI = 865
MOBILITY_ONE_TRAPPED_MULTI = -174
MOBILITY_ZERO_TRAPPED_MULTI = -133

Not too happy about those negative numbers, but this would result in in the following scoring for say bishop with 4 safe and 1 unsafe squares (quite reasonable assumption):

-141*4/100 + 865*5/100 = 39

So basically it's favoring the unsafe squares and trying to reduce the safe squares evaluation. (I wonder if this is indicative of all squares, rather than safe/unsafe squares being preferable)

I'll leave it overnight and see what it comes up with.

[Plan] Tuning

I've started some tuning with CLOP which is an excellent piece of software. Of course I'd want a software completely focused on chess engine tuning, i.e. choose parameters, choose opponents, and receive optimal parameters and expected gain. But this very much good enough.

First tuned the futility levels which resulted in a quite expected (but hard to guess) increase in value of the shallow nodes. From:

120, 120, 310, 310, 400

to

210, 230, 260, 260, 460

So basically a higher margin before skipping searching nodes close to the leaves (remember futility pruning checks how far behind you are in the search and if the arbitrary margin doesn't get you back on the plus side, simply do not keep searching). And a slightly lower margin in nodes further from the leaves, and then slightly higher again for the nodes 5 plies from the leaves.

Pretty much what I expected (I borrowed the previous values from Crafty and generally thought they were a bit optimistic).

The next thing I tuned was the king positioning table in the endgame, which looked like this:

-20 -15 -10 -10 -10 -10 -15 -20
-15 -5 0 0 0 0 -5 -15
-10 0 5 5 5 5 0 -10
-10 0 5 10 10 5 0 -10
-10 0 5 10 10 5 0 -10
-10 0 5 5 5 5 0 -10
-15 -5 0 0 0 0 -5 -15
-20 -15 -10 -10 -10 -10 -15 -20

Again this was just randomly chosen (in this case I think I simply went with gut-feeling to pick the numbers). And I always suspected them to be too low, that is not giving enough credit for having the king in the center in the endgames.

Tuning gave this:

-187 -157 -128 -128 -128 -128 -157 -187
-157 -99 -70 -70 -70 -70 -99 -157
-128 -70 -41 -41 -41 -41 -70 -128
-128 -70 -41 -12 -12 -41 -70 -128
-128 -70 -41 -12 -12 -41 -70 -128
-128 -70 -41 -41 -41 -41 -70 -128
-157 -99 -70 -70 -70 -70 -99 -157
-187 -157 -128 -128 -128 -128 -157 -187

So a whole bunch lower, bit suprising... But bigger differences between center and edges (two pawns difference). I did a quick test of this, and over 200 games it seems it's certainly better (about 60% win rate over the old values).

I have a feeling my evaluation is so badly tuned that I'll be seeing a lot of these quite extreme numbers, and I might have to pass through all the variables a few times until I get them all right.

But I love this tuning business (which I've done very little of in the passed). Simply pass in a few parameters, wait a couple hours, and out comes an improved engine. No effort whatsoever. Silly really. :)

Next thing up is mobility which I have no idea how valid it is at the moment. Currently I do something like count the available squares for a piece, and give twice the number in centipawns along with half the number of unsafe squares (protected by lesser valued pieces).

This gives a really arbitrary number which I have no idea how good it is. Will be really interesting what CLOP comes up with.

Oct 26, 2011

[Info] Profiling

As a response to John's question in a comment in the last post, here's a breakdown of where Mediocre spends its time. Sorted on time spent in each method (based on a 30 second snapshot).



Keep in mind that this does not include subsequent method calls, it's just the time spent inside the actual method, so including subsequent calls, the evaluation would look like this:



Basically more than half the time spent in evaluation, and a third of that is generating attacks for the different pieces.

Also high in the list is the Board.isAttacked-method, due to "is in check" calls in every node in the search, i.e. isAttacked is mostly used to see if the king is in check.

I do think the things related to making and generating moves are in a good place, around 15% of the total time. Maybe slightly high but not too bad.

If I used bitboards instead, with same (or better) performance in generating and making/unmaking moves. I would gain an obscene speed upgrade for "free", as the attack-checks cost a fraction compared to my current setup.

Oct 25, 2011

[Plan] Now I remember

Now I remember why I took a break last time. Annoying piece of ... changes that never improve a thing.

Everything that isn't a completely obvious bugfix (and even some of them) seem to hurt the performance.

So I'm trying to convince myself that I should rewrite the board representation and start using bitboards.

One of the ways (to convince myself) was to shut down everything that had something to do with attacks in the evaluation (which is extremely expensive without bitboards). This means most of the king safety, hanging pieces and mobility, and big parts of pawn evaluation.

Basically a huge part of the evaluation just removed. And the result versus my latest (non-handicapped) version was:
Score of M1B vs M1-1: 529 - 316 - 179  [0.60] 1024

Where M1B was the non-handicapped.

So basically +70 elo points (and probably less if you account for the self-play inflation), for 50% of the evaluation.

Stupid.

I'm sure this isn't completely true, the speed gain is probably worth more in self-play than against other engines (that are already faster).

However, I'll spend some time playing around with bitboards and see where I end up.

Oct 23, 2011

[Info] Slight improvement

It seems the bug fixes I mentioned in my previous post has given a slight but measurable improvement:

Score of M1-1 vs M1B: 420 - 366 - 238 [0.53] 1024

(M1-1 being the newest version with the fixes)

Fortunate since I hadn't committed changes to the svn repository for a while and it started to be hard to see exactly what was changed.

This run includes the tapered eval change. I did a 1024 games test run with only that and it turned out to be exactly 50/50, that is no change in strength whatsoever. I had expected a slight improvement from that, but I guess my eval isn't really tuned for it. But no change is just fine, since I now have it in there and can start building from it.

I've also added a whole bunch of testing features and rewritten my "console mode" quite a bit. For example it's now possible to define a file with fen-positions and do an "evalmirror" on all of them to spot bugs. Might not be so fun for the average user, but extremely useful for me.

Before I release 1.0 I'll make it possible to run testsets in the epd format, directly from within Mediocre.

-

I've tried desperately to improve on the passed pawn evaluation. As it stands it's not very good and I suspect there are a lot of improvements that can be done.

But all my attempts so far has been in vain, I've tried Ed Schröder's way of evaluating them, and also Stockfish's and Crafty's, and some combinations between them coupled with my own ideas.

But everything I do seem to severely hurt the playing strength. And with severely I mean things like 100 rating points worse or so.

I'll keep trying though, I'm sure I'll hit the magic implementation at some point.

[Info] Soo many bugs

I stumbled across a nice test for the evaluation method.

Take a position and evaluate it, then mirror the position (switching all pieces from left to right) and evaluate again, then reflect the position (from front to back, and invert the colors, and switch side to move) and finally reflect and mirror it.

This gives four different positions that should give exactly the same evaluation, assuming a symmetric evaluation.

So far I've found around 10 bugs, and there are probably more to come. Some of them were just symmetric flaws (like giving the king better evaluation if it's on C1 rather than F1, which makes sense on some level but really is not that good).

But some of them were ranging from minor to pretty severe. For example I noticed that my weak pawn evaluation was first of all checking the pawn could move backwards(!) to be supported. And also that I for some infernal reason checked for the absolute value of the piece when checking for blocked pawns. Which on some positions made white not care about the pawn being blocked (unless it was doubled), while black did.

Another one was evaluating fianchetto for black by checking if the pawn in front of the king was on the third rank instead of the fifth.

I'm not sure how much these bugs actually matter in terms of playing strength (sure rewarding fianchetto when pawn is on third rank might be wrong, but a bishop in front of the king in those positions might not be so bad after all).

I'll write something together to explain this excellent test a bit more thouroughly.

Oct 19, 2011

[Guide] Tapered Eval

As I just mentioned here, game phase interpolation (or tapered eval) is a way of avoiding sharp differences between evaluations in different phases of the game.

The following explanation is based on how it's done in Fruit, and there are similar implementations in Crafty and Stockfish (and probably many many other engines as well).

The scaling looks like this:

eval = ((opening * (256 - phase)) + (endgame * phase)) / 256

Where opening is the evaluation of the position with middle game in mind (e.g. keep kings protected behind their pawn covers) and endgame is the evaluation with endgame in mind (e.g. activate the kings). Both these evaluations are done in parallel when evaluating a position.

The phase is evaluated like this (code specifics left out):
PawnPhase = 0
KnightPhase = 1
BishopPhase = 1
RookPhase = 2
QueenPhase = 4
TotalPhase = PawnPhase*16 + KnightPhase*4 +
BishopPhase*4 + RookPhase*4 + QueenPhase*2

phase = TotalPhase

phase -= wp * PawnPhase // Number of white pawns
phase -= wn * Knight // White knights
...
phase -= br * RookPhase
phase -= bq * QueenPhase

phase = (phase * 256 + (TotalPhase / 2)) / TotalPhase

Example
White and black each has NNBR and the evaluation for opening is +100 and endgame is +300

According to the above numbers we then get:

phase = (14 * 256 + (24 / 2)) / 24 = 149
## Where 14 is 24-1-1-1-2-1-1-1-2 (TotalPhase - phases of all pieces)

eval = ((100 * (256 - 149)) + (300 * 149)) / 256 = 216 tapered eval

Chess programming wiki - Tapered Eval

-

Quite ingenious. Funny how the decision on game phase in current version of Mediocre is extremely close to this (without interpolating on it of course), I wonder if I stole that from Fruit way back when I implemented it. :)

[Info] Chessprogramming wiki and me

Since we now have this excellent resource Chessprogramming wiki I've decided to put my efforts on guides there. I.e. anything I write from now on in terms of trying to explain a subject will be put both there and here.

This will probably mean I'll have to try to cut down on my babbling and keep to proven concepts with good source information.

Hopefully Gerd (Isenberg) can keep up with my spamming.. :)

[Plan] Moving on to evaluation

I'm starting to be quite happy with the results of my rewritten search, a bit bigger test gave this:

Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2435 15 15 1508 59.9 % 2365 25.7 %
2 M34 : 2365 15 15 1508 40.1 % 2435 25.7 %

This is comfortably within the bounds for a my new engine being superior. The rating difference is probably inflated quite a bit due this being a self test (as Thomas Petzke pointed out), but nonetheless it's an improvement for sure.

So time to move on to the evaluation revamp.

Having looked through it for the first time in a couple years I remember how extremely complicated it is. :) With a lot of Ed Schröder's recommendations in it (not exactly simplifying things).

I honestly don't think a complete rewrite is plausible. The best course of action is probably to prepare it for some auto-tuning and then start playing massive amounts of games.

However, there are a few things that could be looked over before that.

  • Game phase interpolation (or tapered eval) - Seems every other engine has this now and I can clearly see the benefits. Basically don't switch sharply between the different phases of the game, but rather do it smoothly with separate scores for the different phases.

    Simply put, let's say we evaluate as a middle game and get an evaluation of +200 while an evaluation as endgame gives +400 (maybe the king is close to the middle which would give a penalty in middle game but a bonus in endgame). Now if we're clearly in the middle game (queens and rooks and minor pieces all over), we would just take the +200. And if we're clearly in the endgame (maybe one or two pieces) we take the +400. But if we're in the middle of transitioning to the endgame (perhaps we have a few minor pieces but also queens) we shouldn't take one or the other straight off. But rather interpolate between the two. If we decide we're exactly in the middle between middle game and endgame we would take a +300 score.

    I want to get this done first before poking anymore in the evaluation, since I'll probably need to split up a whole bunch of scores.

    I'll get back to this. (maybe a new guide is in order)

  • Endgame knowledge - Currently a KBKP endgame is evaluated heavily in favour of the side with the bishop (somewhere around +200 centipawns), when it clearly should be close to 0. KRKB give the side with a rook a clear win. And there's plenty more of that. Since I have no plans of adding tablebases (and even if I do I can't assume they'll always be present), I will need to put in some basic endgame knowledge.

  • Passed pawns - Something is not right here. Need to take a good look at how I evaluate them (a simple comparison to Crafty's evaluation gave far too low scores for Mediocre).

  • Tropism - I'm not sure how much is gained (or lost) from this, I just put it in because I could (it's a fairly simple piece of code). Maybe take a look at it and see how others do it.

  • Manual tuning - Find a good way to look at the individual evaluation parameters (even more so than the current one I have) and see if I can spot any obvious flaws. For example I'm quite sure the king's endgame piece square table has too low values (not rewarded enough for moving out the king in the endgame).

  • Auto-tuning - I've had a look at CLOP which seems really nice, and also Joona Kiiski explained how he tunes Stockfish (forum post), which was identified as SPSA but with self-play.

    However I decide to do it, I need to prepare Mediocre's evaluation for it. Which means access to more parameters from outside the evaluation.


So first, tapered eval.

Oct 18, 2011

[Test] Horrible horrible bug (and fever)

Home sick from work I forced my feverish self to go through a few of the horrendous test tournaments I've run since Sunday (the doomsday of my new version, where all I gained suddenly was nullified).

Firstly, I noticed both v0.34 and my new version are quite bad at evaluating pawns. Sure I run the tournaments at very fast time controls (10sec+0.1) and I can't expect them to calculate all the way to queen all the time. But I do think that's where I have the main evaluation weakness. I'll see what I can do about that.

But there were games where my new version's evaluation suddenly dropped from winning to severely losing. On some occasions it was just a bad position that would inevitably lead to doom, but then I found tenths of games where it just dropped the queen for nothing, returning a mate score.

Hopeful and confused I started digging. The "mates" were all singular response moves, like checking with the queen next to the king with the only move available being capturing the queen.

The big problem was I couldn't replicate any of the positions. Inserting them and searching would always give a reasonable move.

Suspecting my new check evasion algorithm I put down traces where the quiescent search returned mate and ran hundreds of test positions, but no luck. So I added traces to every conceivable place where mate could be returned and then finally I found a position where the main alpha-beta loop would return a mate score even though there existed a legal move that avoided it.

So after some stepping through the code I found the culprit, some time for some idiotic reason I had removed the scoring of the hashmove. Since the score isn't really used (hasmoves are always searched first) I must have deemed it unnecessary.

But with my new way of storing moves (by ply) it's imperative that every move is given a score, even if it isn't used. Reason being further searches can stumble upon old scores and act on them. Like setting the hashmove score to -10000 to avoid researching it, and then not research the next hashmove either since the -10000 score is lingering around.

Which is exactly what happened.

I seem to remember mentioning this not so long ago, and still I managed to fall into the trap again. :)

After having fixed the bug I ran a quick 128 game test to make sure it solved the problem. And being burnt from Sunday's fiasco I ran another one as well. :) The result:

Program Elo + - Games Score Av.Op. Draws
1 M1-1 : 2441 37 37 256 61.7 % 2359 28.1 %
2 M34 : 2359 37 37 256 38.3 % 2441 28.1 %
Back in business again! Now I just have to get rid of this stupid cold. :)

Oct 17, 2011

[Test] Rough day

After Saturday's test gauntlet with v0.34 I ran another set with the newest new version (futility pruning being the latest big addition).

And I was completely dumbfounded by the result..
Rank Name              Elo    +    - games score draws
1 Counter 1.2 265 34 33 297 70% 22%
2 Knightx 1.92 140 34 33 297 53% 9%
3 iCE 0.2 124 33 33 297 51% 10%
4 Mediocre v0.34 119 19 19 1107 65% 7%
5 Mediocre 1.0 -2 112 15 15 1857 64% 9%
6 Horizon 4.4 64 34 34 296 44% 7%
7 TJchess 1.01 -8 35 36 295 35% 5%
8 Roce 0.0390 -47 34 36 295 30% 13%
9 Lime 66 -58 36 37 297 30% 5%
10 Adam 3.3 -171 40 43 297 19% 5%
11 Wing 2.0a -252 45 51 296 13% 3%
12 Bikjump 2.01 -290 48 55 297 10% 4%

Mediocre v0.34 back ahead of 1.0?? Having played literally thousands of games the last week I haven't seen any sign of this..

The new version was beating the crap out of 0.34, at 60+% win ratio.

So I ran a few more test sets and v0.34 were neck and neck with any of the newer versions.

Chess programming can be really hard on your self esteem. :)

Re-running a huge (by my standards) gauntlet again, including both v0.34 and v1.0 and we'll see what happens.

I almost suspect I'm using some weird compile or something, but this seems to be the new truth. :(

Oct 15, 2011

[Info] Lazy eval failure

I simply can't get it to work. I tried including pawn eval and king safety in it, which would make it much for accurate (in exchange of plenty of speed obviously), but no go again.

Perhaps the more accurate lazy eval makes it have so little speedup that the few times it actually misses something are enough to make it fail in total.

Some people suggest that the futility pruning already does a big part of lazy eval, which of course makes sense. It's also a check on inability of reaching alpha, even if you were given a piece, but with a full eval and then cutting big chunks out of the search tree.

So maybe they're stepping on each other toes, and if I'd have to choose between the two, futility pruning has a lot more potential.

I'm leaving it for now, will revisit when I start my evaluation revamp though.

[Test] Gauntlet engine test done

Turned out like this for v0.34:
    Engine         Score     
01: Mediocre v0.34 667,0/1000
02: Counter 1.2 70,0/100
03: Knightx 1.92 67,5/100
04: iCE 0.2 53,0/100
05: Horizon 4.4 42,5/100
06: TJchess 1.01 38,5/100
07: Roce 0.0390 30,5/100
08: Wing 2.0a 13,0/100
09: Adam 3.3 11,5/100
10: Bikjump 2.01 4,0/100
11: Lime 66 2,5/100

At the start I was debating whether I should have 8 or 10 engines, and since Lime and Bikjump are a bit too far behind I think I'll just drop those and go with 8. That would enable me to have 800 games played over night, with 100 against each engine.

Elo-wise it looks like this:
Rank Name             Elo    +    - games score draws
1 Counter 1.2 330 56 53 111 72% 22%
2 Knightx 1.92 305 60 57 111 66% 5%
3 iCE 0.2 208 55 54 111 55% 9%
4 Mediocre v0.34 168 20 20 1118 65% 7%
5 Horizon 4.4 104 55 56 110 42% 6%
6 TJchess 1.01 94 55 56 110 41% 9%
7 Roce 0.0390 20 56 59 110 32% 9%
8 Wing 2.0a -178 69 82 111 14% 4%
9 Adam 3.3 -179 69 82 111 13% 5%
10 Lime 66 -274 82 105 111 8% 3%
11 Bikjump 2.01 -288 85 111 111 8% 0%

Mediocre is ranked slightly too high in the field. But I'll stick with this (minus Lime and Bikjump), and add higher ranked engines later on.

[Plan] Deadline for next relase

I received a mail from Olivier Deville this morning, asking for registration to OpenWar 9th Edition, which starts November 15th.

So now I have my deadline for when Mediocre 1.0 will be released. (and probably a couple days before that)

[Blog] Layout fun

Played around a bit with the layout editor for the blog, it's evolved quite a bit in the last few years. :)

Feels kinda fresh, perhaps a bit too plain, but well, different atleast.

Enjoy.

Oct 14, 2011

[Info] Gauntlet test set

After some back and forth testing (couldn't get about half of the engines I tried to work) I've decided on a new set of engines for gauntlet testing. After a very small sample test (100 games gauntlet for Mediocre v0.34) it came out like this:

Rank Name            Rel Elo CCRL Elo
1 Counter 1.2 319 2556
2 iCE 0.2 185 2440
3 TJchess 1.01 138 2403
4 Lime 66 65 2250
5 Knightx 1.92 61 2405
6 Mediocre v0.34 34 -
7 Bikjump 2.01 1 2188
8 Roce 0.0390 -2 1951
9 Horizon 4.4 -38 2452
10 Adam 3.3 -115 2326
11 Wing 2.0a -215 2200

Where "Rel Elo" is result of the test and "CCRL Elo" how they're rated on that list.

Turned out like I wanted, with Mediocre somewhere in the middle, with some far superior and some far inferior engines.

I'll run a more extensive test later to get a good estimated baseline ELO for Mediocre v0.34, so I have something to compare the new version with.

[Info] Time for more testing

Well, not exactly more testing but need to work out a better procedure.

Currently I'm running the tests in Arena. One game at a time (on my quad processor...).

The games are timed at 10sec+0.1sec. With that 1000 games take about 9 hours. Which makes me barely miss the finish before going to work in the morning.

I wanted to use cutechess-cli, which is both faster and enables me to run four games at a time (one for each core of my processor), but I'm having severe problems with the engines timing out, I wonder if it has to do with Java. It's open source so maybe I can do something about it, we'll see.

Anyway, since I have two time windows where I can run tests, 8 hours at night, and 8 hours during the day (while at work), I should probably figure something out to fit that.

Perhaps have a self-play match first, then run a gauntlet if it turns out a new version is better.

Today's test looked like this:
Rank Name                 Elo    +    - games score draws
1 Gaviota-win64-0.84 334 34 30 606 93% 6%
2 Mediocre 1.0 -2 -60 21 20 600 43% 21%
3 Mediocre 1.0 -1 -112 20 20 613 35% 22%
4 Mediocre v0.34 -162 21 21 607 28% 19%

Mediocre -2 is the one with futility pruning, -1 also has lazy eval included. Not so successful it seems.

And Gaviota is beating the living crap out of all versions. Won't allow that for too long. :)

I'll get back on the new testing setup, time to get that stupid cutechess-cli to work somehow.

Oct 13, 2011

[Info] Not so futile attempts

Title refers to an earlier post where I tried to add futility pruning in the past and never really got it to work.

This time though:

Rank Name              Elo    +    - games score draws
1 Mediocre 1.0 -2 31 29 28 100 60% 31%
2 Mediocre 1.0 -1 -31 28 29 100 41% 31%

So in worst case 2 elo better, in best case 60 (the -2 refers to futility pruning added to that version). But somewhere around 30 elo is probably accurate.

Combining this with the test last night would give this:
Rank Name              Elo    +    - games score draws
1 Mediocre 1.0 -2 53 39 38 100 60% 31%
2 Mediocre 1.0 -1 -11 12 12 1100 53% 24%
3 Mediocre v0.34 -42 13 13 1000 46% 24%

Probably flawed to bits with no statistical significance. But I wonder if I can't say that Mediocre has gained somewhere upwards of almost 100 elo.

Probably not, but considerably better it is.

On to internal iterative deepening.

[Info] Decent

Not as good as I had hoped for (it never is), but still a decent and almost statistically certain improvement.

1: Mediocre 1.0    544,5/1000
2: Mediocre v0.34 455,5/1000

Seems I finally managed to improve on Mediocre v0.34! :)

A few more things left to implement before I'll consider myself done with the search revamp for a while.

After that I'll have a look at the evaluation. I'm quite happy with what's in it, but I'm certain the evaluation parameters are no where near optimal (in fact I have a hard time imagining an engine having less optimal parameters as many are borrowed and manually adjusted from other engines, and some are leftovers from early version of Mediocre, and some were just arbitrarily added).

Anyway, I can now promise that the next version of Mediocre will have a gain in playing strength. Yay! :)

[Info] Another breakthrough, maybe

I spent the evening dabbling with root move ordering. I had a general idea of what I wanted to do. Something like:

  1. Break out the first ply from the alphaBeta-method so it's possible to add some unique logic to it (examining a few open source engines it seems fairly common that this is done incorporated in the main search method, which I don't quite understand since it adds both complexity to the code and a bunch of extra if-statements).

    This also allows us to make sure the best move is always obtainable (as I mentioned some posts ago I suspect there were some losses where the best move was overwritten in the hash).

    And finally output the currently examined move to UCI, which is kinda nice.

  2. 2. Add an extensive move ordering algorithm to the root moves. Doesn't matter how expensive since it will be done at most like 30-50 times, compared to to the millions of ordinary nodes.

    Having read quite extensively about it, without actually finding any concrete evidence of the "best" approach I settled for trying a couple alternatives.

    * Do a quiescence search on all moves and order by the values it returned, but bringing the PV move up front if not there already.

    * Count the nodes each move caused and order by that.

    * A combination of the above (quiescence score with some scale for lower plies, and node counts deeper depths)

I tried all kinds of combinations, having a rather unreliable but good enough set of 50 positions to try it on, which with no modifications took slightly over 2 minutes with a fixed depth of 8.

I was hovering around that result for all my attempts. A second over or a second under. It seemed my search was immune to root ordering, or rather the ordering obtained by just calling the alphaBeta-method directly was good enough and my attempts at improving it were futile.

Finally I decided to test something else. Instead of using quiescence search I used a one ply ordinary alpha-beta search, combined with the node counts. Obviously this is not something new, I seem to remember some years ago that this was the way to go, but many have moved on to better things (which didn't work out for me).

So I tried it on the test set again. 30 seconds?! Down from above 2 minutes. First reaction was look for bugs, but couldn't find any.

Have a test going again, we'll see where it lands. Looks promising though.

Oct 12, 2011

[Info] A small breakthrough

First I added the history heuristic (and mate checks in quiescence search which shouldn't amount to much).

This seemed to help a tad:

1: Mediocre v0.34  502,5/858
2: Mediocre 1.0 355,5/858

A slight improvement from the last test. I think the history heuristic helps since I don't order my moves for things like moving forward/towards the center.

I used an idea from GreKo where you keep track of the history of all moves made (from/to as usual), and also moves that causes a beta cutoff (in GreKo this uses moves that raised the value of alpha, but beta cutoff worked slightly better for me). Then divide the beta cutoff history value with the all-move value.

This should make the history data a bit more sane (essentially penalizing it for making the move without causing a cutoff).

-

Now the above was all nice, v0.34 went from 62.25% win rate to 58.57%.

But then I decided to try something I've been meaning to for some time. That is cut all moves with a negative SEE-score in the quiescence search. Point being they are extremely unlikely to prove useful for anything, at least that the quiescence search can detect.

And this happened:
1: Mediocre 1.0    352,0/700
2: Mediocre v0.34 348,0/700

Close to dead even score, and most likely enough games to confirm that neither engine would beat the other on a regular basis.

So basically I'm back to the strength of v0.34. And I still haven't even done basic root ordering, which I'm expecting to show significant gains.

Fun fun.

Oct 11, 2011

[Info] Another test, with killer moves

Killer moves implemented with a reasonable gain:
1: Mediocre v0.34 622,5/1000
2: Mediocre 1.0 377,5/1000

Still some ways to go, but certainly getting closer.

I have some suspicion the new version is forfeiting some games due to not returning a best move. I have a safety measure for this in v0.34 but not having the root search in place for v1.0 it's missing there.

There is also something fishy with mate scores in the new version, will have to look into that.

I'm hoping root move ordering will bring v1.0 up to par with v0.34. An even score at that stage would certainly be thrilling as there is still extensions, futility pruning and internal iterative deepening left to do.

Oct 10, 2011

[Info] A little test

Getting closer to v0.34:
1: Mediocre v0.34  711,0/1000 
2: Mediocre 1.0 289,0/1000

What's left to implement is internal iterative deepening, killer moves, extensions and a more elaborate root move ordering. I should try history moves for ordering as well, though from what I've read they seem to have little relevance anymore. But trying doesn't hurt.

Oct 9, 2011

[Info] Starting on basic move ordering

I've spent the last day trying to identify a bug that made Mediocre go haywire (dropping pieces, missing mates in one etc.) when the hash move was used for ordering, which is of course really strange since it shouldn't add or remove any moves, just search them in a different order.

Of course it turned out that one of the few new things I've added was the culprit.

Instead of keeping two integer arrays, one for the moves and one for the ordering values, I've mashed them together into a Move-object. Pretty much only for clarity (it shouldn't cause any performance degradation).

Since I didn't use any move ordering at all, ordering part of the Move-object was left untouched. That is until I started to use hash moves which I order as 10,000 (search first in any circumstance) and then -10,000 to mark it already used.

As no other moves received a score, that -10,000 was lingering in the move array and eventually caused no moves to be searched. :)

-

Well, that's over and done with and using the hash move for ordering gave the following result in my standard mini-test:

1: Mediocre 1.0+  14,0/20
2: Mediocre 1.0 6,0/20

Barely any statistical relevance really, but I'm happy.

Steady going, on to more move ordering.

Oct 8, 2011

[Info] Check evasions

I'm taking baby steps and try to get everything right from the beginning.

Having considered Jon Dart's suggestion and examined a few open source engines (GreKo http://greko.110mb.com/ is a favorite) I decided to add the notion of not standing pat in the quiescence search if in check.

That required continuing for one step and a need for check evasions.

My new check evasion code takes about a fourth of the time compared to the old one, which seems nice. And there might still be a few more optimizations to do.

Last bug a squashed concerning it was evading a check by capturing to the last rank and promote to a non-queen.

How I've missed these ultra-specific bugs... :)

Oct 7, 2011

[Info] A new approach

After having spent about 20 hours the last week trying to tweak Mediocre into becoming atleast a tiny bit better I suddenly had an epiphany and gave up.

It's been far too long since I last worked with this and there are so many things going on that I have no idea if they help or hurt.

So I scrapped the entire search function and started over...

Currently Mediocre 1.0 beta searches about 4-5 ply on average and gets beaten to scraps by v0.34. :) So far I've implemented quiescence search and transposition tables, and pretty much nothing else, not even rudimentary move ordering.

In doing so I've noticed that something is not well with how I handle transposition scores. For example I've been returning alpha/beta scores from the table rather than the actual alpha/beta in the node. This seems very wrong.

There's also something fishy going on with the hash move, but I'm not sure if this is true for v0.34 (there might be something fixing it along the way).

-

Anyway, I'm having fun, and the improvements come in 100 rating chunks again. My favorite kind of improvement. :)

Let's see if it ends up surpassing v0.34 eventually. I certainly hope so.

Edit:

First non-loss for the new engine! :)

   Engine            Score
1: Mediocre v0.34 19,5/20
2: Mediocre 1.0 beta 0,5/20

Oct 6, 2011

Sad day

Sometimes when you innovate, you make mistakes. It is best to admit them quickly, and get on with improving your other innovations. - Steve Jobs

Oct 5, 2011

[Info] Some info please and a TODO

I've implemented Jon Dart's suggested changes and I think they bring some improvement to playing strength. However to be honest, I'm not sure. My small tests so far point in all kinds of directions.

I need to set up a new testing environment and have taken a look at cutechess-cli which seems really nice. But I'm having problems with Mediocre crashing (or rather stalling) while running tournaments. Especially the new version. It also leaves really expensive Mediocre processes running, completely clogging up the computer.

I have no idea why.

This brings me to an important point that's been missing in Mediocre since the beginning. A decent logging framework. This should shed some light on those pesky stalls.

-

So a TODO for the next couple of days:

General

Set up a testing environment using cutechess-cli, some creative scripting (to be able to run gauntlet matchs) and a number of well-chosen engines. Most likely using the YATS testing again (even though I can't get ProDeo to work anymore).

Get some decent logging going. This should help immensely in tracking down bugs and getting a feel for where improvements might be needed.

Engine improvement

Apart from Jon Dart's improvements:

Pondering - It's time to get it done, and shouldn't take too long.

Endgame knowledge and evaluation - Take a look at what I started with two years ago. I might be able to shake out some improvements.

Oct 3, 2011

Reboot

What happened to the repositories?

It has been a long long time since I took a look at the source code at https://sourceforge.net/projects/mediocrechess/. I have a firm memory of using CVS, but there's no trace of CVS ever being used.

Well, I'm planning for a new version of Mediocre. Thanks to Jon Dart (http://www.arasanchess.org/) for coming with some suggested improvements that sparked my interest.

So in light of that I'll set up everything again. Mediocre v0.34 has been uploaded to the SVN on Sourceforge and I'll be using it "correctly" this time.

Trunk will be used for ongoing development, branches for specific ongoing changes, and tags for released versions.

Look forward to a new version in the near future.

Feb 15, 2011

Long time no see

In my last post I promised to update Mediocre before 6 months. It has now been almost 16 months. Well well.

On a personal note I now work with Java development in a investment bank, and while it's a super job it leaves me little time for such time consuming activities as chess programming.

Mediocre is as mentioned not abandonded, just resting, and while I won't make any promises this time, I will sooner or later get around to updating again.

This post is mostly here to let you know I'm still around, and if you have questions I'm still reading your mails (mediocrechess@gmail.com, but I will most likely answer from zlaire@gmail.com).

Until later.