AWS 東京リージョンで発生した大規模障害についてまとめてみた

2019年8月23日 13時頃からAmazon AWS 東京リージョンでシステム障害が発生し、EC2インスタンスに接続できない等の影響が発生しています。ここでは関連する情報をまとめます。

AWSの障害報告

aws.amazon.com

AWS障害の状況

障害発生時間（EC２）	約６時間 2019年8月23日 12時36分頃～18時30分頃（大部分の復旧）
障害発生時間（RDS）	約９時間半 2019年8月23日 12時36分頃～22時5分頃
障害原因（EC２）	一部EC2サーバーのオーバーヒートによる停止制御システム障害により冷却システムが故障したことに起因
影響範囲	東京リージョン（AP-NORTHEAST-1）の単一のAZに存在する一部EC2、EBS、およびRDS。

発生リージョンは東京。東京近郊４データセンター群の内、1つで発生。
日本国内のAWSの契約先は数十万件とみられる。*1

障害報告があったサービス

piyokangoが確認した範囲で影響発表されていた（非公式含む）サービスは以下の通り。
同時期に報告されたものを集めたもので、全てがAWS障害が原因かどうかは不明です。

決済系

障害報告のあったサービス	原因・発生事象
PayPay	サービス断続的に使用不可
ファミペイ	AWS障害
BillingSystem（PayB）	クラウド事業者のネットワーク障害

SNS系

障害報告のあったサービス	原因・発生事象
mixi	接続障害
ピグパーティ	ログイン障害
株探	AWS障害
teacup.	クラウドサービス障害
GameWith	AWS障害

暗号資産取引系

障害報告のあったサービス	原因・発生事象
バイナンス	AWS障害
フィスコ（Zaif）	AWS障害
GMOコイン	接続障害

社内システム系

障害報告のあったサービス	原因・発生事象
日本通運	メールシステム障害

サービス系

障害報告のあったサービス	原因・発生事象
郵便局（クリックポスト）	AWS障害
NTTドコモ（dTV、dマガジン）	一部利用できない等の影響
ドコモバイクシェア	AWS障害
ドコモ・ヘルスケア	クラウドサービス不具合
ちよくる	システム障害
楽天（ラクマ）楽天チケット	システム不具合
ふるなびトラベル	AWS障害
朝日新聞デジタル	システム不具合
日本経済新聞電子版	更新停滞
週刊東洋経済プラス	AWS障害
SmartNews	AWS障害
Hulu	AWS障害
TVer	サービス障害
ローソン	ネットワーク障害
スターバックスコーヒージャパン	システム障害
イオンシネマ	サーバー障害
コカコーラ（コークオン）	通信障害
ピザハット	AWS障害
駅すぱあと	AWS障害
あすけん	クラウドサーバー障害
伊藤忠テクノソリューションズ（CIM-LINK）	原因調査中
まぐまぐ	AWS障害
Sansan	ネットワーク障害
スマホサイフ	システム障害
SmartHR	AWS障害
freee	AWS障害
Backlog	AWS障害
Bizseek	AWS障害
CLIP STUDIO	接続障害
XFLAG	サーバーシステム障害
LanCul	AWS障害
Progate	社外の通信サービスシステム障害
ネイティブキャンプ	AWS障害
GLOBIS	AWS障害
グッピーズ	サーバー運営会社障害
eMark+	システム障害
type転職エージェント	AWS障害
信州大学 eALPS	AWS障害
Serverworks	AWS障害
東レACS	AWS障害
JBCC	AWS障害
建設業振興基金（建設キャリアアップシステム）	システム障害
SavaMoni	監視基盤要因による表示遅延
バッファロー	サーバー障害
癒し処倉田屋	システム障害
求人＠飲食店.COM	AWS障害
サンメディア（ARROW）	システム障害
ヘルスケアシステムズ（カラダチェック）	システム障害
店舗デザイン.COM	AWS障害
KENKEY	AWS障害
高崎モータースクール	バスシステム予約障害
ルネサンス	システム障害
NEXWAY（NEXLINK）	AWS障害
Paravi	コンテンツ視聴障害
CYCLE & STUDIO R Shibuya	予約システム障害
はんず	予約システム障害
バニスタ	予約システム障害
スパジアムジャポン	チケット販売障害

EC系

障害報告のあったサービス	原因・発生事象
ユニクロ	AWS障害
東急ハンズ	Webサーバ障害
小学館（PALSHOP）	サーバー障害
PIXTA	AWS障害
STORES.jp	AWS障害
SHOPLIST	AWS障害
駿河屋通販サイト	AWS障害
PCワンズ	AWS障害
ドスパラ	通信障害
Anker Japan	AWS障害
アリスブックス	AWS障害
iich	データセンターネットワーク障害
FREAK'S STORE	システム障害
Snow peak	AWS障害
IDEA online	システム障害
マザーハウス	サーバー障害
エレクター	システム障害
筆まめネット	システム障害
ナノ・ユニバース	発送遅延
ファイテン	システム障害
ブラザーダイレクトクラブ	AWS障害
通販素材.COM	接続障害

公式サイト・アプリ、ファンクラブ系

障害報告のあったサービス	原因・発生事象
アクサ生命保険	AWS障害
京セラコミュニケーションシステム	システム障害
日本相撲協会	通信障害
東京ヤクルトスワローズ	サーバー障害
オリックス・バファローズ	システム障害
福岡ソフトバンクホークス	サービス障害
日本サッカー協会	AWS障害
名古屋グランパス	サーバー障害
北海道コンサドーレ札幌	AWS障害
Movie Walker	サーバー障害
バンダイチャンネル	接続障害
マイナビ研修サービス	サーバーエラー
ワコム	接続障害
コロンビアスポーツウェアジャパン	サーバー障害
WHILL	システム障害
とらや	ネットワークシステム障害
AISAN TECHNOLOGY	システム障害
Crevo	AWS障害
LOWYA	AWS障害
GOODSMILE RACING	接続障害
ポルノグラフィティ	接続障害
Perfume	通信障害
ディーンフジオカ	データセンター障害
サザンオールスターズ	データセンター障害
パスピエ	AWS障害
A.C.E JAPAN	AWS障害
スターダストチャンネル	サーバー障害
東映特撮ファンクラブ	動画サービス障害
NOISE MANIA	AWS障害
ペライチ	システム障害
AKB48チーム8	サーバー障害
SKE48 Mobile	AWS障害
水瀬いのり	接続障害
うたの☆プリンスさまっ♪ Shining Live」2周年特設サイト	AWS障害
SiM App	AWS障害
アイドルマスターシャイニーカラーズ	サーバー障害
小林愛香公式FC（チケット申込）	サーバー障害
魔法少女ザ・デュエル	大規模サーバー障害

ゲーム系

障害報告のあったサービス	原因・発生事象
ドラゴンネストM	AWS障害
DMM GAMES	接続障害
パズル＆ドラゴンズ	機器調整メンテナンス
誰ガ為のアルケミスト	通信障害
メイプルストーリーM	接続遅延
アイドルマスター SideM　LIVE ON ST@GE！	通信障害
イケメンライブ	AWS障害
ブレイブフロンティア	通信障害
KOF'98 UM OL	サーバー側大規模通信障害
グラフィティスマッシュ	AWS障害
きららファンタジア	AWS障害
SINoALICE	通信障害
逆転オセロニア	ログイン障害
ららマジ	ログイン障害
バンドリ！ガールズバンドパーティ！	接続障害
ガールズ＆パンツァー戦車道大作戦！	AWS障害
イケメン源	AWS障害
戦艦帝国	AWS障害
ドラゴンボールZ ドッカンバトル	接続障害
かんぱに☆ガールズ	サーバー障害
幽☆遊☆白書 100%本気(マジ)バトル公式	サーバー障害
プリキュアつながるぱずるん	サーバー障害
ジョジョのピタパタポップ	一部のサーバー障害
ドラゴンボールZ ブッチギリマッチ	サーバー障害
HiGH&LOW	通信障害
駅メモ！	他社クラウドサーバー障害
ゴシックは魔法乙女	サーバ提供会社大規模通信障害
キングスレイド	AWS障害
ツキパラ	サーバー障害
グランドチェイス	日本AWS障害
アイドリッシュセブン	AWS障害
クラッシュフィーバー	通信障害
スターライトステージ	通信障害
仮面ライダーゲームインフォ	サーバー提供会社大規模通信障害
FFブレイブエクスヴィアス	データセンター障害
城とドラゴン	AWS障害
バトルフェスティバル	サーバー障害
HANDEAD ANTHEM	サーバー大規模通信障害
踊り子クリノッペ	接続障害
スタンドマイヒーローズ	AWS障害
スーパーロボット大戦DD	AWS障害
鋼鉄戦記C21	Webサービス障害
東京喰種	通信障害
共闘ことばRPG コトダマン	国内の大規模サーバー障害
NEWSに恋して	サーバー障害
オルタナティブガールズ2	ログイン・接続障害
ミリオンダウト	サーバー障害

その他

障害報告のあったサービス	原因・発生事象
ホビーセンターカトー	運転見合わせ
AtCoder	AWS障害
クリスタルボルリユニオン	サーバー通信障害
SAO メモリーデフラグ	緊急メンテナンス

Amazon ステータスページ

以下のStatusページで最新の状況が公開されている。
status.aws.amazon.com

Amazon Elastic Compute Cloud (Tokyo)

2019/08/23 13:18	We are investigating connectivity issues affecting some instances in a single Availability Zone in the AP-NORTHEAST-1 Region.
2019/08/23 13:47	We can confirm that some instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the AP-NORTHEAST-1 Region. Some EC2 APIs are also experiencing increased error rates and latencies. We are working to resolve the issue.
2019/08/23 14:27	We have identified the root cause and are working toward recovery for the instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region.
2019/08/23 15:40	We are starting to see recovery for instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances and EBS volumes.
2019 /08/23 17:54	Recovery is in progress for instance impairments and degraded EBS volume performance within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances and EBS volumes.
2019/08/23 18:39	The majority of impaired EC2 instances and EBS volumes experiencing degraded performance have now recovered. We continue to work on recovery for the remaining EC2 instances and EBS volumes that are affected by this issue. This issue affects EC2 instances and EBS volumes in a single Availability Zone in the AP-NORTHEAST-1 Region.
2019/08/23 20:18	Beginning at 8:36 PM PDT a small percentage of EC2 servers in a single Availability Zone in the AP-NORTHEAST-1 Region shutdown due to overheating. This resulted in impaired EC2 instances and degraded EBS volume performance for resources in the affected area of the Availability Zone. The overheating was caused by a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. The chillers were restored at 11:21 PM PDT and temperatures in the affected areas began to return to normal. As temperatures returned to normal, power was restored to the affected instances. By 2:30 AM PDT, the vast majority of instances and volumes had recovered. We have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible. Some of the affected instances may require action from customers and we will be reaching out to those customers with next steps.

Amazon Relational Database Service (Tokyo)

2019/08/23 13:22	We are investigating connectivity issues affecting some instances in a single Availability Zone in the AP-NORTHEAST-1 Region.
2019/08/23 14:25	We have identified the root cause of instance connectivity issues within a single Availability Zone in the AP-NORTHEAST-1 Region and are working toward recovery.
2019/08/23 15:01	We are starting to see recovery for instance connectivity issues within a single Availability Zone in the AP-NORTHEAST-1 Region. We continue to work towards recovery for all affected instances.
2019/08/23 17:16	We continue to see recovery for instance connectivity issues within a single Availability Zone in the AP-NORTHEAST-1 Region and are working towards recovery for all affected instances.
2019/08/23 20:46	The majority of instance connectivity issues have now recovered. We continue to work on recovery for the remaining instance connectivity issues within a single Availability Zone in the AP-NORTHEAST-1 Region.
2019/08/23 22:19	Between August 22 8:36 PM and August 23 6:05 AM PDT, some RDS instances experienced connectivity issues within a single Availability Zone in the AP-NORTHEAST-1 Region. The Issue have been resolved and the service is operating normarly.