[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with using model bundle for prediction #1876

Closed
xholmes opened this issue Aug 30, 2023 · 15 comments
Closed

Issues with using model bundle for prediction #1876

xholmes opened this issue Aug 30, 2023 · 15 comments

Comments

@xholmes
Copy link
xholmes commented Aug 30, 2023

Hi,

Thank you for the awesome work on developing raster-vision.
I had manage to trained a model for semantic segmentation on satellite images using code adapted from the ISPRS Potsdam example.

However, I had run into issues using the model bundle for prediction. I had tried several ways and had ran into either failures or a varying degree of success. I have listed each of my approach and level of success below. Would very much appreciate your views on how I can use the model bundle effectively for prediction on unseen images/data.

Regardless of approach, my goal is to predict on a list of unseen images and to generate a prediction tif and vector file for each class.

Approach 1 - Using Predictor class

For the 1st approach, I had tried to use the Predictor class. I had tried this method with RV v0.12 with success. However, come v0.20, I got the following error:

Traceback (most recent call last):
  File ".../v020_scripts/predict_pipeline.py", line 31, in <module>
    predictor = Predictor(args.m, args.t)
  File ".../venv/lib/python3.9/site-packages/rastervision/core/predictor.py", line 58, in __init__
    self.scene: 'SceneConfig' = self.config.dataset.validation_scenes[0]
AttributeError: 'LearnerPipelineConfig' object has no attribute 'dataset'

Note: I redacted some of the path information so the path looks weird. Please ignore the weird file paths.

The bundle I used was the model-bundle.zip file in the bundle folder, generated by the training script. I have also tried using the model bundle file found in the train folder, but that too, produced the same error.

For reference, key lines of the code used for this approach is as follows:

    # Specify the model bundle file and tmp directory 
    predictor = Predictor(args.m, args.t)
    print("Model loaded in {}".format(time.perf_counter() - t_start))

    label_format = '{}_label.tif'
    vector_format = '{}_{}_vector.json'
    predict_tif = os.path.join(args.o, 'labels.tif')
    predict_vector_path = os.path.join(args.o, 'predict')

    # Run predict on one image at a time
    o_start = time.perf_counter()
    for img in img_list:
        t_start = time.perf_counter()
        predictor.predict([img], args.o)

Approach 2: Using the predict command-line command

The 2nd approach I tried was to use the predict command-line command. For this approach, I used the model-bundle.zip file found in the bundle folder generated when I finish the model training. However, I did not get far with this approach as I kept getting the following error:

Error: Missing argument 'MODEL_BUNDLE`

This is despite having provided the module bundle in the following format:

  1. specifying the exact path to the model-bundle.zip file
  2. specifying the folder rather than the file
  3. unzip the model-bundle.zip file and specify the folder
  4. unzip the model-bundle.zip file and specify the model-bundle.zip file (there was another model-bundle.zip file in the zip file)

Approach 3:
My first approach was to use the Learner class as suggested here. Following is the script put together:

    chip_sz = 64
    stride = math.floor(chip_sz / 4)

    label_classes = ['background', 'imprevious', 'building', 'tree', 'bare_earth']
    # black, yellow, blue, green, orange
    label_colours = ['#000000', '#f2e70c', '#0000ff', '#36ff00', '#e87400']

    class_config = ClassConfig(names=label_classes, colors=label_colours,
                               null_class='background')

    # Load data to use for inference
    img_set = [f for f in os.listdir(args.i) if f.endswith('.tif')]

    # Load pre-trained model for inference
    learner = SemanticSegmentationLearner.from_model_bundle(
        args.b, training=False
    )
    vector_folder = os.path.join(args.o, 'vector')
    vector_uri = os.path.join(args.o, 'vector', '{}_class{}.json')
    os.makedirs(vector_folder, exist_ok=True)

    for f in img_set:
        img = os.path.join(args.i, f)
        fname = f.split('.')[0]
        output_uri = os.path.join(args.o, fname)
        os.makedirs(output_uri)

        raster_source = RasterioSource(img)
        stats_transformer = StatsTransformer.from_raster_sources(
            raster_sources=[raster_source],
            max_stds=3
        )

        ds = SemanticSegmentationSlidingWindowGeoDataset.from_uris(
            class_config=class_config,
            image_uri=img,
            size=chip_sz,
            stride=stride,
            image_raster_source_kw={
                'channel_order': [0, 1, 2],
                'raster_transformers': [stats_transformer]
            }
        )

        predictions = learner.predict_dataset(
            ds,
            raw_out=True,
            numpy_out=True,
            predict_kw=dict(out_shape=(chip_sz, chip_sz)),
            progress_bar=True
        )

        # Predict on image and save labels as vectors
        pred_labels = SemanticSegmentationLabels.from_predictions(
            ds.windows,
            predictions,
            smooth=True,
            extent=ds.scene.extent,
            num_classes=len(label_classes)
        )

        pred_labels.save(
            uri=output_uri,
            crs_transformer=ds.scene.raster_source.crs_transformer,
            class_config=class_config,
            vector_outputs=[
                PolygonVectorOutputConfig(
                    class_id=1, uri=vector_uri.format(fname, 1)
                ),
                PolygonVectorOutputConfig(
                    class_id=2, uri=vector_uri.format(fname, 2)
                ),
                PolygonVectorOutputConfig(
                    class_id=3, uri=vector_uri.format(fname, 3)
                ),
                PolygonVectorOutputConfig(
                    class_id=4, uri=vector_uri.format(fname, 4)
                )
            ]
        )

I have tried this approach with the bundle found in the train folder. With this approach, I can successfully run the prediction. However, the prediction results is much poorer compared to the trained results (as indicated in the validation images used during training).

Environment

  • Installed via pip install, ran on command line
  • Raster Vision version or commit: RV v0.20
  • OS (e.g., Linux): Linux
  • Python version: 3.9.5
  • CUDA 11.6.2
@AdeelH
Copy link
Collaborator
AdeelH commented Aug 31, 2023

Hi, thank you for the detailed report and sorry that you're having problems with RV.

Approach 1

I believe this error implies you are using the model bundle from the train/ folder instead of the one in the bundle/ folder, which is what you are supposed to use here.

Approach 2

I think I know what is going on here. You are probably specifying --channel-order before the arguments to the predict command, something like rastervision predict --channel-order 0 1 2 <model_bundle_uri> <image_uri> <out_uri> and this is running into this hacky bit of code that parses the channel indices given to --channel-order.

The workaround is to simply specify --channel-order at the end of the command. Like so:

rastervision predict \
https://s3.amazonaws.com/azavea-research-public-data/raster-vision/examples/model-zoo-0.20/isprs-potsdam-ss/model-bundle.zip \
https://s3.amazonaws.com/azavea-research-public-data/raster-vision/examples/model-zoo-0.20/isprs-potsdam-ss/sample-predictions/sample-img-isprs-potsdam-ss.tif \
prediction_output/ \
--channel-order 0 1 2

Approach 3

One reason for the worse results could be that you are using different stats to normalize the images with the StatsTransformer than the ones you used during training. You can try using the stats from analyze/ with your new data.

It is also possible that your new data is significantly different from your training set, in which case, you would see worse results if your model does not generalize well. So I would recommend predicting on images from your training/validation set as a sanity check.

@xholmes
Copy link
Author
xholmes commented Sep 1, 2023

Hi @AdeelH,

Thank you for the prompt reply.

For approach 1, I had tried using both bundles from the bundle folder and the train folder. Both gives the same error.

Will give your suggestion on the order of the parameters a go for approach 2.

As for approach 3, I also thought of using the same validation image for a sanity check. The following is a comparison of the output of the validation during training vs output of the prediction (same image). This was done using the same model. (please ignore the difference in colour as I did not play around with the colour display of the outputs, but you can notice the lack of class predictions in the predicted output from the large regions of blacks (background)).

Output of validation (during training)
Screenshot 2023-09-01 at 11 27 22 AM

Output of prediction
Screenshot 2023-09-01 at 11 25 50 AM

@xholmes
Copy link
Author
xholmes commented Sep 1, 2023

I gave the change of --channel-order 0 1 2 to the end of the command-line a go. As you suggested, that solved the issue of missing bundle error. However, it ran into the same missing attribute error:

AttributeError: 'LearnerPipelineConfig' object has no attribute 'dataset'

I was using the bundle in the bundle folder. I suspect it might be the same issue as the one faced in Approach 1.

Hi, thank you for the detailed report and sorry that you're having problems with RV.

Please, you don't have to apologise. I am thankful for your devotion and work on developing RV. Having done development of frameworks before, I appreciate the dedication and hardwork required for developing these frameworks. Keep up the good work!

@AdeelH
Copy link
Collaborator
AdeelH commented Sep 1, 2023

I gave the change of --channel-order 0 1 2 to the end of the command-line a go. As you suggested, that solved the issue of missing bundle error. However, it ran into the same missing attribute error:

AttributeError: 'LearnerPipelineConfig' object has no attribute 'dataset'

Interesting. That error is precisely what happens when you pass train/model-bundle.zip to the Predictor instead of bundle/model-bundle.zip. Can you confirm that the two bundles are not exactly the same? Are pipeline-config.jsons in each of them a different size than the other? And does bundle/model-bundle.zip have another model-bundle.zip inside of it (i.e. inside the zip archive)?

I would recommend doing a fresh run (you can just run for 1 epoch to make it go faster) and then trying with the new bundle.

Also, are you able to successfully run the example predict command in my previous comment?


As for approach 1, can you try visualizing the predicted scores (like in the tutorial) to check if they are messed up before they are saved to file?

@xholmes
Copy link
Author
xholmes commented Sep 4, 2023

Interesting. That error is precisely what happens when you pass train/model-bundle.zip to the Predictor instead of bundle/model-bundle.zip. Can you confirm that the two bundles are not exactly the same? Are pipeline-config.jsons in each of them a different size than the other? And does bundle/model-bundle.zip have another model-bundle.zip inside of it (i.e. inside the zip archive)?

Right! That might be where I mucked up. Upon checking the zip file in the bundle/ folder, it seems I had earlier unzipped the file. I have replaced the original file and am rerunning it now. Will update when I get the results back.

But to answer your previous question, following is the comparison when I unzip both bundles:
Content of bundle/model-bundle.zip
Screenshot 2023-09-04 at 11 57 48 AM

Content of train/model-bundle.zip
Screenshot 2023-09-04 at 11 57 37 AM


Also, are you able to successfully run the example predict command in my previous comment?

After changing the order of the -channel-order flag, the command worked. However, I ran into the no attribute 'dataset' error. As you pointed out above, it is likely I went to unzip the bundle/model-bundle.zip file previously as I was exploring the file. I have re-run the job now. Waiting for the results to come back. Will update when I get back the results.

@xholmes
Copy link
Author
xholmes commented Sep 5, 2023

Hi @AdeelH,

So after following your suggestions, I have managed to get all 3 approaches running successfully!

For approach 1, where I was using the Predictor class with the bundle/model-bundle.zip, the main issue was that I had previously unzipped the zip file in the bundle folder and forgotten about it. Replacing it with the original bundle file worked out well. Outputs of the prediction were also more in-line with the performance of the trained model.
Screenshot 2023-09-05 at 11 21 45 AM


With approach 2, where I was using the rastervision predict on the command-line, the re-positioning of the --channel-order flag to the end worked like a charm. The rest was due to me using the unzipped bundle/model-bundle.zip. Prediction results were the same as the one above (also because I was using the same model file).


Still waiting on the results from using the Learner class from Approach 3. Cluster had been quite busy these few days. I had removed the use of StatsTransformer from the pipeline. However, I suspect it won't do much good. The difference in the prediction is too large. Will update again when I get the results.

@xholmes
Copy link
Author
xholmes commented Sep 5, 2023

Did as you suggested to check the prediction results when using the Learner class. I used the train/model-bundle.zip for reference. The outcome of the prediction seems to still vary significantly from the model's performance after training (see the post above for the expected outcome of the trained model).

Visual inspection of the prediction output saved as vectors.
Screenshot 2023-09-05 at 5 45 59 PM

Prediction score before output is being saved

ham_may_test1_scores

@AdeelH
Copy link
Collaborator
AdeelH commented Sep 5, 2023

Awesome! Glad you were able to get the Predictor working.

If you have ruled out StatsTransformer as a problem in approach 3, my next guess would be a discrepancy in either the chip_sz or the img_sz. Ideally, they should be the same values as used for training. Note that for img_sz (which, by the way, is the size that chips are resized to before being passed to the model), you will need to explicitly pass it as a resize transform (transform=A.Resize(img_sz, img_sz)) to SemanticSegmentationSlidingWindowGeoDataset as shown here: https://docs.rastervision.io/en/stable/usage/tutorials/pred_and_eval_ss.html#Get-scene-to-predict.

@xholmes
Copy link
Author
xholmes commented Sep 5, 2023

Note that for img_sz (which, by the way, is the size that chips are resized to before being passed to the model)

I actually did not try to resize the chips, both during training and prediction. Instead, I was thinking to crop/chip them into smaller chunks of size 64. I did this during training as follows:

data_config = SemanticSegmentationGeoDataConfig(
            scene_dataset=scene_dataset,
            window_opts=GeoDataWindowConfig(method=GeoDataWindowMethod.sliding,
                                            size=chip_sz, stride=stride),
            img_channels=len(channel_order)

where chip_sz is 64 and stride is 1/4 of chip_sz (I wanted some overlaps). However, revisiting my code (I did this code quite sometime back), I am wondering if what I did was correct as I noticed that I had also declared a SemanticSegmentationChipOptions instance

    chip_options = SemanticSegmentationChipOptions(
        window_method=SemanticSegmentationWindowMethod.sliding, stride=chip_sz
    )

and gave it to the Config

pipeline = SemanticSegmentationConfig(
        root_uri=output_uri,
        dataset=scene_dataset,
        backend=backend_config,
        train_chip_sz=chip_sz,
        predict_chip_sz=chip_sz,
        chip_options=chip_options
    )

Apologies, I digressed. As I did not do any resizing, in my prediction script in approach 3, I specified the chip sizes (where chip_sz and stride are both the same values as those used in training) as follows:

        ds = SemanticSegmentationSlidingWindowGeoDataset.from_uris(
            class_config=class_config,
            image_uri=img,
            size=chip_sz,
            stride=stride,
            image_raster_source_kw={
                'channel_order': [0, 1, 2]
            }
        )

        predictions = learner.predict_dataset(
            ds,
            raw_out=True,
            numpy_out=True,
            predict_kw=dict(out_shape=(chip_sz, chip_sz)),
            progress_bar=True
        )

As a sanity check, can you help verify if what I had done does what I had intended?

@AdeelH
Copy link
Collaborator
AdeelH commented Sep 5, 2023

Ah. img_sz is a field in GeoDataConfig which is 256 by default. See the docs here for SemanticSegmentationGeoDataConfig.img_sz. So if you did not explicitly set it, 256 was the value used. So your model expects its inputs to be 256x256.

If you add the resize transform in my previous comment, your approach should hopefully work better. A more automatic way to do this would be to do something like:

tf, _ = learner.cfg.data.get_data_transforms()
...
ds = SemanticSegmentationSlidingWindowGeoDataset.from_uris(
    ...,
    transform=tf
 )

DataConfig.get_data_transforms() returns a tuple of transforms, the first one of which is a "base" transform that is applied to both training and validation datasets and the second one is an augmentation transform that is applied to only the training dataset.

This is basically what Raster Vision does internally when you predict via the Predictor.

As a sanity check, can you help verify if what I had done does what I had intended?

Your use of chip_sz and stride is correct.

@xholmes
Copy link
Author
xholmes commented Sep 6, 2023

So if you did not explicitly set it, 256 was the value used. So your model expects its inputs to be 256x256.

Omg! Now everything clicks. That probably explains why I see artifacts like the the ones below in my predicted outputs (from using the Predictor).
Screenshot 2023-09-06 at 1 25 17 PM

Ok, I will retrain my models but this time specifying img_sz. Thank you so much for the insights @AdeelH !

@AdeelH
Copy link
Collaborator
AdeelH commented Sep 6, 2023

I should hasten to add that using an img_sz > chip_sz is not a bad thing! Quite the opposite, in fact. It can, for example, make it easier for the model to see smaller objects and thus do a finer segmentation.

To sum up: chip_sz controls how much content is read from the TIFF and img_sz controls the size is it is stretched to before being fed into the model.

@xholmes
Copy link
Author
xholmes commented Sep 7, 2023

If you add the resize transform in my previous comment, your approach should hopefully work better. A more automatic way to do this would be to do something like:

Tried this and approach 3 using the Learner class worked now. Setting smooth to True in SemanticSegmentationLabels also helped with the boundary artifacts (I think).
Screenshot 2023-09-07 at 4 19 03 PM

Thank you @AdeelH for your help! Now everything is working. Quick question though, what is the difference between using SemanticSegmentationLabels.from_predictions with smooth=True and using SemanticSegmentationSmoothLabels.from_predictions?

@AdeelH
Copy link
Collaborator
AdeelH commented Sep 7, 2023

Thank you @AdeelH for your help! Now everything is working.

Wonderful!

Quick question though, what is the difference between using SemanticSegmentationLabels.from_predictions with smooth=True and using SemanticSegmentationSmoothLabels.from_predictions?

They are equivalent.

Setting smooth to True in SemanticSegmentationLabels also helped with the boundary artifacts (I think).

It should, yes. Using stride < chip_sz during prediction also makes it so that (for non-corner pixels) the same pixel gets predicted multiple times (once for each chip that it is a part of) and all of its predictions averaged. It, however, also makes prediction slower, so you should try to find a stride that strikes good balance between speed and quality. An additional way to deal with boundary artifacts is the crop_sz argument; see an example here.

@xholmes
Copy link
Author
xholmes commented Sep 7, 2023

It should, yes. Using stride < chip_sz during prediction also makes it so that (for non-corner pixels) the same pixel gets predicted multiple times (once for each chip that it is a part of) and all of its predictions averaged. It, however, also makes prediction slower, so you should try to find a stride that strikes good balance between speed and quality. An additional way to deal with boundary artifacts is the crop_sz argument; see an example here.

Yes, while searching for how to tackle the boundary artifacts, I came across that post of yours. Will play around with the different parameters.

Once again, thank you for your help. Much appreciated. I'll close the issue since everything's working now.

@xholmes xholmes closed this as completed Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants