• 如何看待人工智能攻破德州扑克

    发布日期:2022-03-09 19:17    点击次数:111



    四名扑克选手:Daniel McAulay (左一),Jimmy Chou(左二),Jason Les(右二)、Dong Kim(右一)。人工智能Libratus的项目主任(左三),工程师(右三)
    此次由4名世界顶级扑克职业玩家:Jason Les、Dong Kim、Daniel McAulay 和Jimmy Chou对战人工智能程序Libratus,赛程为20天,一共进行了12万手牌的比赛。最后人工智能以1766250分的优势战胜4位人类选手。

    比赛模式为1对1(head up)德州扑克,在20天内,4位人类玩家总共打12万手,每位玩家各自与人工智能进行3万手牌1对1德州扑克。平均每天打1500手牌,进行10小时比赛,每小时打150手1对1德州扑克。
    比赛模式类似以下网络扑克应用 Play Texas Holdem Against Strong Poker Ai Bots , 这款扑克AI名称叫HibiscusB,能战胜中级水平的扑克玩家,但没有Libratus强大。


    4位扑克选手总共输给人工智能Libratus 1766250分,即17662.5个盲注(大盲注100分)。其中Dong Kim 的成绩最好,但也输掉了85649分,即856个盲注。成绩最差的Jason Les 输掉了8800个盲注。
    这是什么概念,按当前中国大多数地下德州扑克的游戏,通常玩5/10元大小盲注,1000-2000元一个买进(绝对违法)。与人工智能1对1打head up,每小时要输220元,平均一天要输掉2200元,20天要输掉4.4万元人民币。
    如果有人会问,如果把这个人工智能程序拿来,然后连接到国外扑克网站Poker star,Full Tile上赢美刀多爽啊。德州扑克有10^160可能性,运行该程序所需的超级电脑。价格可能数百万美金不止,估计每小时消耗耗的电费都要比赢来的钱还多。

    也就是说,每场比赛打12万手牌,人类与AI进行1000场比赛,AI将赢下998场,人类只能赢下2场。因此,人工智能Libratus 有着完全不可逆转的优势。
    也就是说,每场比赛打12万手牌,人类与AI进行1000场比赛,AI将赢下998场,人类只能赢下2场。因此,人工智能Libratus 有着完全不可逆转的优势。

    在与同样AI对局3万手牌后,四个玩家成绩分出档次,其中Dong Kim 输掉856个盲注,比Jason Les 输掉了8800个盲注的成绩好10倍。另外,两名玩家各自输了2776个盲注,5728个盲注。
    如果这四个人相互对局3万手牌,Dong Kim 与 Jason Les对局,那他也会赢到8800-856 = 7944个盲注,也许会有上下1000个盲注的波动。总之Dong Kim 的牌技优势还是高于Jason Les ,但要打上万手牌才能分出胜负。

    一直说这4位选手为世界顶级扑克玩家,可大多数扑克迷都没听说过他们。怎么没有 Phil Ivey, Daniel Negreanu,Tom dwan这些扑克明星呢?
    其实,大家每天在视频上看到的那些扑克界的明星都是5-6年前的对局了。当网络扑克兴起后,大量优秀的扑克玩家涌现。任何事情搬到到互联网上,发展速度都变得惊人。5年的扑克水平在网络上能赢到100万美金,5年后却只能输钱,所以原来的高手,并不是现在高手。如今让Daniel Negreanu 到 Poker Star 打1/2美元的游戏,他未必定能赢到钱。
    另外,这场人类与AI的对局要每天打8-10个小时,打上20天,奖金还不到20万美金。Tom dwan在澳门赌场里一手牌输掉1100万美金。所以他们不屑于为了这么点奖金,打这么漫长的比赛。
    当人工智能以巨大的优势战胜这4位高手,可以肯定世界上没人能打败人工智能Libratus。因为Libratus是根据纳茨博弈理论,经过Counterfactual Regret Minimization(反事实思维) 方法学习后,形成最完美的扑克打法。

    人工智能在扑克的应用:Counterfactual Regret Minimization
    扑克人工智能是通过Counterfactual Regret Minimization进行100万亿手牌的训练来形成一套完美的打法。
    例如: 一个打法疯狂的玩家100个大盲注全压,拿AJ,AQ,TT,99 以上的牌跟注就足够了,但如果一个打牌非常紧的玩家100个盲注全压,至少要AK,QQ以上的牌才能跟注。
    因此,人工智能还必须根据近期相关性的牌局,来调整自己的打牌的范围,进而适应不同对手,不同的打法。这就需要另一项技术应用recursive reasoning 来进行 Continuous Re-Solving。。。
      “比赛到一半的时候,我们真的以为要赢了,”其中一位专业玩家丹尼尔. 麦考利(Daniel McAulay)说。“我们真的有机会打败它。”
      “我们用了所有能想到的办法,它实在是太强大了,”另一位扑克玩家杰森.莱斯(Jason Les)说。“它每天的出现都让我们士气低落,最后输的这么惨。我以为我们最后的筹码会非常接近。”
    太累啦!o (╯□╰)o
    后面文章以后在翻译啦,如果这篇文章上知乎日报的话,可以考虑 ( ´◔ ‸◔`)

    如果大家对人工智能感觉太抽象,很难理解,可以看本人写过的一篇人工智能的应用介绍,简单易懂,初中生就能明白。Introduction to CMAC Neural Network with Examples


    However, how the opponent’s actions reveal that information depends upon their knowledge of our private information and how our actions reveal it. This kind of recursive reasoning is why one cannot easily reason about game situations in isolation,
    which is at the heart of local search methods for perfect information games. Competitive AI approaches in imperfect information games typically reason about the entire game and produce a complete strategy prior to play (14, 15).2 Counterfactual regret minimization (CFR) (11, 14, 17) is one such technique that uses self-play to do recursive reasoning through adapting its strategy against itself over successive iterations. If the game is too large to be solved directly, the common solution is to solve a smaller, abstracted game. To play the original game, one translates situations and actions from the original game in to the abstract game.
    While this approach makes it feasible for programs to reason in a game like HUNL, it does so by squeezing HUNL’s 10160 situations into the order of 1014 abstract situations.

    DeepStack takes a fundamentally different approach. It continues to use the recursive reasoning of CFR to handle information asymmetry. However, it does not compute and store a complete strategy prior to play and so has no need for explicit abstraction. Instead it considers each particular situation as it arises during play, but not in isolation. It avoids reasoning about the entire remainder of the game by substituting the computation beyond a certain depth with a fast approximate estimate. This estimate can be thought of as DeepStack’s intuition: a gut feeling of the value of holding any possible private cards in any possible poker situation. Finally, DeepStack’s intuition, much like human intuition, needs to be trained. We train it with deep learning using examples generated from random poker situations. We show that DeepStack is theoretically sound, produces substantially less exploitable strategies than abstraction-based techniques, and is the first program to beat professional poker players at HUNL with a remarkable average win rate of over 450 mbb/g.
    Continuous Re-Solving
    Suppose we have a solution for the entire game, but then in some public state we forget this
    strategy. Can we reconstruct a solution for the subtree without having to solve the entire game
    again? We can, through the process of re-solving (17). We need to know both our range at
    the public state and a vector of expected values achieved by the opponent under the previous
    solution for each opponent hand. With these values, we can reconstruct a strategy for only the
    remainder of the game, which does not increase our overall exploitability. Each value in the opponent’s
    vector is a counterfactual value, a conditional “what-if” value that gives the expected
    value if the opponent reaches the public state with a particular hand. The CFR algorithm also
    uses counterfactual values, and if we use CFR as our solver, it is easy to compute the vector of
    opponent counterfactual values at any public state.
    Re-solving, though, begins with a solution strategy, whereas our goal is to avoid ever maintaining
    a strategy for the entire game. We get around this by doing continuous re-solving:
    reconstructing a strategy by re-solving every time we need to act; never using the strategy beyond
    our next action. To be able to re-solve at any public state, we need only keep track of
    our own range and a suitable vector of opponent counterfactual values. These values must be
    an upper bound on the value the opponent can achieve with each hand in the current public
    state, while being no larger than the value the opponent could achieve had they deviated from
    reaching the public state.5
    At the start of the game, our range is uniform and the opponent counterfactual values are
    initialized to the value of holding each private hand at the start.6 When it is our turn to act

    Exploitability The main goal of DeepStack is to approximate Nash equilibrium play, i.e., minimize exploitability. While the exact exploitability of a HUNL poker strategy is intractable to compute, the recent local best-response technique (LBR) can provide a lower bound on a strategy’s exploitability (20) given full access to its action probabilities. LBR uses the action probabilities to compute the strategy’s range at any public state. Using this range it chooses its response action from a fixed set using the assumption that no more bets will be placed for the remainder of the game.
