2 views

I'm trying to understand how the MCTS Algorithm works and how I would implement it in a card game to improve the AI engine.

I've read mcts.ai/ website and many papers about it, including one that shows some results about the successfulness of applying Monte Carlo Search with UCB in the AI for a Magic cards game, that is more or less what I need to do, however I'm having some trouble trying to understand some points and how to apply it to solve what I need. I'm also not that much experienced in maths so I become lost when the papers explain all that stuff with complicated formulas.

This is what I've come up with so far:

1. Given a game state (user hand in the game), determine which are all the possible legal plays that can be made then I would create a List of Nodes (one representing every play) as a property in the MCTSTree's root node with each one's result (score value?)

2. Simulate a complete (until the end) gameplay for each one of those legal plays with a random player and record the result in every node, whether the player has won or lost to have a full picture.

Here is where "I think" Monte Carlo + UCB should be applied:

1. Select the more promising play (node) using UCB recursively and in case its leaf, expand that node with all possible plays from its game state.

2. Simulate n playouts from the selected node until a certain amount of time is reached.

• At this stage, I have some doubts... say I try a random playout given a list of possible playouts... what do I have to do with that first result to continue simulating? Should I make the tree grow then?

3. How do I backpropagate the results?

Then,

• Having in mind that as this is a complex card game and I have so many possible moves... would it has a good-enough performance to have so many children in any node?

• If every simulation is based on a game state and the game is changing the state every time a player applies a movement then how can I know if the tree is really useful?

I would appreciate any help on this.

Thank you very much!

by (108k points)

UCB performs a greedy best response towards its optimistic action value estimates. Smooth UCB plays a smooth best response instead. However, this is not what happens in self-play, where each time players are trying to use each other rather than playing their average policy. Intuitively, combining the current UCB policy with the average policy might help with this policy evaluation problem. The algorithm below instantiates general self-play MCTS with a Smooth UCB tree policy.

https://pdfs.semanticscholar.org/7b68/7599b4425aa959036071030e1212a3b359c7.pdf