Pathways: Google's AGI

Lê Nguyên Hoang

A month ago, Jeff Dean, lead of Google AI, announced at TED that Google is developing a "general purpose intelligence system". To give a bit of context, last January, Google already published a paper on a 1.6-trillion parameter model based on the architecture of switch transformers, which improves by on order of magnitude upon GPT-3. Yet, I've heard that Google usually only publishes a weaker version of the algorithms they actually develop and deploy at scale.

From a technical viewpoint, and given my understanding of switch transformers and of the challenges of scaling neural networks, I'm guessing that Pathways' model will now be decentralized on multiple machines (as it does not fit on a single machine), and its computation (both in forward pass and backprop) must optimize its "pathway" by only leveraging some of the machines Pathways is deployed on. Google seems to want to leverage this to build models with 100-trillion+ parameters (note that models have been roughly x10 per year since Bert).

As a comparison, the human brain is estimated to have hundreds of trillions of synapses connecting its 100 billion neurons. In terms of numbers, Google's algorithms may soon match the human brain. However, as opposed to a human brain, these algorithms crunch billions of data points per second. Moreover, these data are extremely informative, as they track the daily habits, behaviors and beliefs of billions of humans on earth, through their activities on Google Search, YouTube, Google Mail, Google Drive, Google Maps or even Google Smart Keyboard. Perhaps more importantly still, Google's algorithms leverage this "intelligence" to have a massive-scale impact, by choosing which information will be shown to which human on earth, especially through the YouTube recommendation algorithm and Google Adsense.

I'm curious to know how this news updates your beliefs on the following questions:

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

EDIT: I see that many of you doubt that a recommendation algorithm can be an AGI. Does this mean that you reject the orthogonality thesis?

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

(if you believe that any unaligned AGI is an existential threat, then I guess that you should answer yes...)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

Those who are interested in my views can check the Tournesol wiki (currently in staging unchangeable mode, but should come back to normal in a few weeks). Tournesol is a non-profit project colleagues and I have launched to robustly solve AI alignment with a "short-term" agenda. We are searching for funding opportunities.

[-]Charlie Steiner4y20

Speaking for the Youtube Recommender pessimists, the problem I see is first training data, and second the human overseers.

Training just on video names and origins, along with watch statistics across all users, doesn't seem to reward clever planning in the physical world until after the AI is generally intelligent. This is the exact opposite of how you'd design an environment to encourage general real-world planning. (Curriculum learning is a thing for a reason)

Second, the people training the thing aren't trying to make AGI. They're totally fine with merely extracting the surface-level regularities of the data rather than the structure of the physical world, so they won't keep jamming in more resources after it plateaus. If they're at all worried about planning, they'll make the reward function myopic (though perhaps they should already be doing this but aren't because $$$, which is concerning).

[-]tailcalled4y10

As I understand it, Google's proposed model is a MoE model, and I've heard MoE models achieve poorer understanding for equivalent parameter count than classical transformer models do.

[-]gwern4y*80

It might be more useful to discuss Google's dense GPT-like LaMDA-137b instead, because there's so little information about Pathways or MUM. (We also know relatively little about the Wu Dao series of competing multimodal sparse models.) Google papers refuse to name it when they use LaMDA, for unclear reasons (it's not like they're fooling anyone), but they've been doing interesting OA-like research with it: eg "Program Synthesis with Large Language Models", "Finetuned Language Models Are Zero-Shot Learners", or text style transfer.