Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Azure by (5.8k points)

I am trying to deploy a prediction web service to Azure using ML Workbench process using cluster mode in this tutorial (https://docs.microsoft.com/en-us/azure/machine-learning/preview/tutorial-classifying-iris-part-3#prepare-to-operationalize-locally)

The model gets sent to the manifest, the scoring script, and schema

Creating service..........................................................Error occurred: {'Error': {'Code': 'KubernetesDeploymentFailed', 'Details': [{'Message': 'Back-off 40s restarting failed container=...pod=...', 'Code': 'CrashLoopBackOff'}], 'StatusCode': 400, 'Message': 'Kubernetes Deployment failed'}, 'OperationType': 'Service', 'State':'Failed', 'Id': '...', 'ResourceLocation': '/api/subscriptions/...', 'CreatedTime': '2017-10-26T20:30:49.77362Z','EndTime': '2017-10-26T20:36:40.186369Z'}

Here is the result of checking the ml service realtime logs

C:\Users\userguy\Documents\azure_ml_workbench\projecto>az ml service logs realtime -i projecto

2017-10-26 20:47:16,118 CRIT Supervisor running as root (no user in config file)

2017-10-26 20:47:16,120 INFO supervisord started with pid 1

2017-10-26 20:47:17,123 INFO spawned: 'rsyslog' with pid 9

2017-10-26 20:47:17,124 INFO spawned: 'program_exit' with pid 10

2017-10-26 20:47:17,124 INFO spawned: 'nginx' with pid 11

2017-10-26 20:47:17,125 INFO spawned: 'gunicorn' with pid 12

2017-10-26 20:47:18,160 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2017-10-26 20:47:18,160 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

2017-10-26 20:47:22,164 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

2017-10-26T20:47:22.519159Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0

2017-10-26T20:47:22.520097Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (12)

2017-10-26T20:47:22.520375Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync

2017-10-26T20:47:22.521757Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300

2017-10-26T20:47:22.522646Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 22

2017-10-26 20:47:27,669 WARN received SIGTERM indicating exit request

2017-10-26 20:47:27,669 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die

2017-10-26T20:47:27.669556Z, INFO, 00000000-0000-0000-0000-000000000000, , Handling signal: term

2017-10-26 20:47:30,673 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die

2017-10-26 20:47:33,675 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die

Initializing logger

2017-10-26T20:47:36.564469Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client

2017-10-26T20:47:36.564991Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator

2017-10-26T20:47:36.565316Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks

2017-10-26T20:47:36.565642Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function

2017-10-26 20:47:36.715933: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instruc

tions, but these are available on your machine and could speed up CPU computations.

2017-10-26 20:47:36,716 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die

2017-10-26 20:47:36.716376: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instruc

tions, but these are available on your machine and could speed up CPU computations.

2017-10-26 20:47:36.716542: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructio

ns, but these are available on your machine and could speed up CPU computations.

2017-10-26 20:47:36.716703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructi

ons, but these are available on your machine and could speed up CPU computations.

2017-10-26 20:47:36.716860: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructio

ns, but these are available on your machine and could speed up CPU computations.

this is the init

2017-10-26T20:47:37.551940Z, INFO, 00000000-0000-0000-0000-000000000000, , Users's init has completed successfully

Using TensorFlow backend.

2017-10-26T20:47:37.553751Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 22)

2017-10-26T20:47:37.885303Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master

2017-10-26 20:47:37,885 WARN killing 'gunicorn' (12) with SIGKILL

2017-10-26 20:47:37,886 INFO stopped: gunicorn (terminated by SIGKILL)

2017-10-26 20:47:37,889 INFO stopped: nginx (exit status 0)

2017-10-26 20:47:37,890 INFO stopped: program_exit (terminated by SIGTERM)

2017-10-26 20:47:37,891 INFO stopped: rsyslog (exit status 0)

Received 41 lines of log

My best guess is there's something silent happening to cause "WARN received SIGTERM indicating exit request". The rest of the scoring.py script seems to kick off - see TensorFlow get initiated and the "this is the init" print statement.

http://127.0.0.1:63437 is accessible from my local machine, but the UI endpoint is blank.

Any ideas on how to get this up and running in an Azure cluster? I'm not very familiar with how Kubernetes works, so any basic debugging guidance would be appreciated.

1 Answer

0 votes
by (9.6k points)

Well Azure ML Workbench itself is outdated and is AML Studio now but this could have happened due to a bug. Hope the bug was fixed. 

Browse Categories

...