--- orphan: true --- (aws-neuron-core-inference-tutorial)= # Serve an Inference Model on AWS NeuronCores Using FastAPI (Experimental) This example compiles a BERT-based model and deploys the traced model on an AWS Inferentia (Inf2) or Tranium (Trn1) instance using Ray Serve and FastAPI. :::{note} Before starting this example: * Set up [PyTorch Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx) * Install AWS NeuronCore drivers and tools, and torch-neuronx based on the instance-type ::: ```bash python -m pip install "ray[serve]" requests transformers ``` This example uses the [j-hartmann/emotion-english-distilroberta-base](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base) model and [FastAPI](https://fastapi.tiangolo.com/). Use the following code to compile the model: ```{literalinclude} ../doc_code/aws_neuron_core_inference_serve.py :language: python :start-after: __compile_neuron_code_start__ :end-before: __compile_neuron_code_end__ ``` For compiling the model, you should see the following log messages: ```text Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00k/1.00k [00:00<00:00, 242kB/s] Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 329M/329M [00:01<00:00, 217MB/s] Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 294/294 [00:00<00:00, 305kB/s] Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 22.0MB/s] Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 57.0MB/s] Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 6.16MB/s] Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 448kB/s] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Saved Neuron-compiled model ./sentiment_neuron.pt ``` The traced model should be ready for deployment. Save the following code to a file named aws_neuron_core_inference_serve.py. Use `serve run aws_neuron_core_inference_serve:entrypoint` to start the Serve application. ```{literalinclude} ../doc_code/aws_neuron_core_inference_serve.py :language: python :start-after: __neuron_serve_code_start__ :end-before: __neuron_serve_code_end__ ``` You should see the following log messages when a deployment is successful: ```text (ServeController pid=43105) INFO 2023-08-23 20:29:32,694 controller 43105 deployment_state.py:1372 - Deploying new version of deployment default_BertBaseModel. (ServeController pid=43105) INFO 2023-08-23 20:29:32,695 controller 43105 deployment_state.py:1372 - Deploying new version of deployment default_APIIngress. (ProxyActor pid=43147) INFO 2023-08-23 20:29:32,620 http_proxy 10.0.1.234 http_proxy.py:1328 - Proxy actor 8be14f6b6b10c0190cd0c39101000000 starting on node 46a7f740898fef723c3360ef598c1309701b07d11fb9dc45e236620a. (ProxyActor pid=43147) INFO: Started server process [43147] (ServeController pid=43105) INFO 2023-08-23 20:29:32,799 controller 43105 deployment_state.py:1654 - Adding 1 replica to deployment default_BertBaseModel. (ServeController pid=43105) INFO 2023-08-23 20:29:32,801 controller 43105 deployment_state.py:1654 - Adding 1 replica to deployment default_APIIngress. 2023-08-23 20:29:44,690 SUCC scripts.py:462 -- Deployed Serve app successfully. ``` Use the following code to send requests: ```python import requests response = requests.get(f"http://127.0.0.1:8000/infer?sentence=Ray is super cool") print(response.status_code, response.json()) ``` The response includes a status code and the classifier output: ```text 200 joy ```