single shot object detection

T* /R33 9.9626 Tf BT /R132 162 0 R << /Parent 1 0 R >> T* 11.9551 TL << /R31 62 0 R After the above steps, each sampled patch will be resized to fixed size and maybe horizontally flipped with probability of 0.5, in addition to some photo-metric distortions [14]. An object detection model is trained to detect the presence and location of multiple classes of objects. 1 0 0 1 80.3586 176.657 Tm /Font << endobj /R33 54 0 R [ (FCN) -258.005 (\133) ] TJ [ (Hikvision) -249.989 (Research) ] TJ /R86 141 0 R Q /R33 9.9626 Tf /R124 166 0 R ET [ (has) -366.011 (already) -366.996 (been) -366.016 (e) 15.0122 (xtensi) 25.002 (v) 14.9828 (ely) -366.99 (studied\056) -658.994 (Currently) -365.986 (there) -366.998 (are) ] TJ T* And SSD300 has 79.6% mAP which is already better than Faster R-CNN of 78.8%. With more output from conv layers, more bounding boxes are included. /Filter /FlateDecode Furthermore, FC6 and FC7 use Atrous convolution (a.k.a Hole algorithm or dilated convolution) instead of conventional convolution. >> 13.6992 -4.33789 Td << 6 0 obj 10 0 0 10 0 0 cm 10 0 0 10 0 0 cm 0 g 1 0 0 1 105.683 236.433 Tm And an increase of 2%-3% mAP is achieved across multiple datasets as shown below: SSD300 and SSD512 both have higher mAP and higher FPS. (DES) Tj [ (r) 37.0196 (esolution) -214 (ver) 9.99588 (sion\054) -220.996 (we) -213.995 (ac) 15.0183 (hie) 14.9852 (ve) -214.012 (an) -213.994 (mAP) -213.992 (of) -215.011 (79\0567) -213.986 (on) -213.994 (V) 30 (OC2007) ] TJ Q /R43 9.9626 Tf /R33 9.9626 Tf Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. >> 14.4 TL f* This means that, in contrast to two-stage models, SSDs do not need an initial object proposals generation step. endobj 9.68398 0 Td /Type /XObject >> /R64 103 0 R /R122 170 0 R /s5 gs BT q /x12 Do [ (1) -0.30019 ] TJ /R41 58 0 R /R85 101 0 R T* /ProcSet [ /Text /ImageC /ImageB /PDF /ImageI ] (1) Tj /Resources << /F1 282 0 R >> Q stream 10 0 0 10 0 0 cm /Parent 1 0 R I have recently spent a non-trivial amount of time buildingan SSD detector from scratch in TensorFlow. Single-shot methods like SSD suffer from extremely by class imbalance. /R35 7.9701 Tf 1 0 0 1 264.248 236.433 Tm Q From the MobileNet architecture, the last fully … [ (\135) -400.014 (and) -400.007 (R\055) ] TJ /R41 58 0 R The base network is VGG16 and pre-trained using ILSVRC classification dataset. (Sik-Ho Tsang @ Medium). /F2 281 0 R 35.9133 TL [ (be) -313.987 (considered) -314.009 (as) -313.984 (an) -313.007 (attention) -313.992 (process\054) -330.009 (where) -313.992 (each) -313.997 (channel) ] TJ /Subtype /Form ET /ExtGState << (1) Tj /R68 110 0 R Q BT /Rotate 0 BT 1 0 0 1 358.586 250.139 Tm /Annots [ 180 0 R 181 0 R 182 0 R 183 0 R 184 0 R 185 0 R 186 0 R 187 0 R 188 0 R 189 0 R 190 0 R 191 0 R 192 0 R 193 0 R 194 0 R ] /R33 9.9626 Tf >> stream %PDF-1.3 /ExtGState << In essence, SSD is a multi-scale sliding window detector that leverages deep CNNs for both these tasks. Q (20) Tj -11.9551 -11.9559 Td 1 0 0 rg 0 g /R33 9.9626 Tf ET [2016 ECCV] [SSD]SSD: Single Shot MultiBox Detector, [R-CNN] [Fast R-CNN] [Faster R-CNN] [YOLOv1] [VGGNet], Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. q Q T* [ (1) -0.30019 ] TJ /R33 9.9626 Tf [ (Detection) -253.987 (with) -252.995 (Enriched) -253.987 (Semantics) ] TJ 10 0 0 10 0 0 cm [ (junction) -468.002 (with) -468.006 (that\054) -522.004 (we) -467.992 (employ) -467.009 (a) -467.995 (global) -468.005 (activation) -468 (mod\055) ] TJ T* /R41 58 0 R /R94 136 0 R [ (Single\055Shot) -249.999 (Object) -249.998 (Detection) -250.003 (with) -250.013 (Enriched) -250.008 (Semantics) ] TJ 1 0 0 1 182.455 248.388 Tm >> /R90 129 0 R 19.6762 -4.33906 Td /Type /Page ET /Parent 1 0 R ET ET /R252 309 0 R << >> /MediaBox [ 0 0 612 792 ] (10) Tj /R33 9.9626 Tf 0 1 0 rg [ (consists) -257.996 (of) -259.013 (tw) 10.0081 (o) -257.991 (branches\054) -261.008 (a) -258.001 (detection) -258.011 (branch) -259.011 (and) -257.991 (a) -258.981 (se) 15.0171 (gmen\055) ] TJ 10 0 0 10 0 0 cm 1 0 0 1 247.949 236.433 Tm /Font << [ (tion\056) -506.986 (It) -315.982 (tak) 10.0057 (es) -315.004 (the) -316.018 (lo) 24.9885 (w) -315.984 (le) 25.0179 (v) 14.9828 (el) -315.001 (detection) -316.001 (feature) -315.011 (map) -316.011 (as) -315.986 (input\054) ] TJ /R84 102 0 R 0.1 0 0 0.1 0 0 cm /R71 94 0 R It is a technique in computer vision, which is used to identify and locate objects in an image or video. /R30 32 0 R /R126 157 0 R /x18 15 0 R (\054) Tj /R33 11.9552 Tf 0.5 0.5 0.5 rg /Parent 1 0 R 10 0 0 10 0 0 cm /R153 199 0 R BT (\054) Tj (\056) Tj /R75 97 0 R >> T* /Count 9 Below is a SSD example using MobileNet for feature extraction: From above, we can see the amazing real-time performance. >> T* /R157 200 0 R 1 0 0 1 528.906 349.315 Tm /Length 53223 /Length 107 ET /Contents 195 0 R ET (\054) Tj /R37 7.9701 Tf 0 1 0 rg Q Q /R66 107 0 R Q 1 0 0 1 297 35 Tm T* To solve the complex scale variations, single-shot detectors make scale-aware predictions based on multiple pyramid layers. 97.4816 4.33789 Td ET /Resources << Single-shot MultiBox Detector is a one-stage object detection algorithm. [ (perfect) -250.005 (lo) 24.9885 (w) -249.995 (le) 25.0179 (v) 14.9828 (el) -249.995 (features\056) ] TJ Classic object detectors are based on sliding window approach … /R33 9.9626 Tf /R35 7.9701 Tf Q /R30 32 0 R /R128 177 0 R q /ExtGState << 0 g >> Q /R59 112 0 R [ (le) 25.0179 (v) 14.9828 (el) -331.001 (detection) -329.999 (feature) -331.011 (map) -330.009 (with) -330.979 (strong) -329.999 (semantic) -331.018 (informa\055) ] TJ 10 0 0 10 0 0 cm /Contents 151 0 R (2) Tj endobj This is because it has much better AP (4.8%) and AR (4.6%) for larger objects, but has relatively less improvement in AP (1.3%) and AR (2.0%) for small objects. /R261 300 0 R Sample a patch so that the overlap with objects is 0.1, 0.3, 0.5, 0.7 or 0.9. As you can see in the above image we are detecting coffee, iPhone, notebook, laptop … >> endobj Multi-scale increases the robustness of the detection by conside… /ColorSpace << ET 0 1 0 rg q That means the scale at the lowest layer is 0.2 and the scale at the highest layer is 0.9. >> q BT q /ExtGState << /R33 11.9552 Tf [ (3) -0.30019 ] TJ q 1 0 0 1 280.556 236.433 Tm 11.9551 TL 11.9559 TL 1 0 0 1 171.207 152.747 Tm By using SSD, we only need to take one single shot to detect multiple objects within the image, while regional proposal network (RPN) based approaches such as R-CNN series that need two shots, one for generating region proposals, one for detecting the object of each proposal. /R33 54 0 R 1 0 0 1 197.638 248.388 Tm It is notintended to be a tutorial. [ (with) -250.004 (an) -249.988 (infer) 36.9951 (ence) -250.006 (speed) -249.99 (of) -249.985 (13\0560) -250.02 (milliseconds) -249.988 (per) -249.99 (ima) 10.013 (g) 10.0032 (e) 15.0122 (\056) ] TJ 1 0 0 rg [ (lo) 24.9885 (w) -354.017 (le) 25.0179 (v) 14.9828 (el) -354.017 (dete) 0.98268 (ction) -354 (feature) -353.985 (map\054) -380 (we) -353 (also) -354.005 (emplo) 9.98363 (y) -354.014 (a) -353.985 (global) ] TJ /R33 9.9626 Tf 0 1 0 rg 0 1 0 rg >> 10 0 0 10 0 0 cm -49.7234 -11.9551 Td /R154 198 0 R The goal of object detection is to recognize instances of a predefined set of object … /ExtGState << /ExtGState << Q [�R� �h�g��{��3}4/��G���y��YF:�!w�}��Gn+���'x�JcO9�i�������뽼�_-:`� /x15 18 0 R 15 0 obj 11.9551 TL /R152 201 0 R Authors think that boxes are not enough large to cover large objects. For illustration, we draw the Conv4_3 to be 8 × 8 spatially (it should be 38 × 38). FC6 and FC7 are changed to convolution layers as Conv6 and Conv7 which is shown in the figure above. With batch size of 1, SSD300 and SSD512 can obtain 46 and 19 FPS respectively. Q /R41 58 0 R /Rotate 0 /R30 32 0 R 10 0 0 10 0 0 cm 1 0 0 rg 0 1 0 rg SSD512 (80.0%) is 4.1% more accurate than Faster R-CNN (75.9%). endstream [ (le) 25.0179 (v) 14.9828 (el) -275.991 (se) 15.0196 (gmentation) -275.988 (ground\055truth\056) -388.989 (Then) -275.983 (it) -276.018 (augments) -276.983 (t) 0.98513 (he) -277.013 (lo) 24.986 (w) ] TJ /BBox [ 132 751 480 772 ] >> BT /Kids [ 3 0 R 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R ] [ (the) -259.988 (lo) 24.9885 (w) -259.993 (le) 25.0179 (v) 14.9828 (el) -259.993 (features) -259.018 (usually) -260.006 (only) -260.011 (capture) -259.996 (basic) -259.986 (visual) -260.011 (pat\055) ] TJ -11.9551 -11.9563 Td (27) Tj 1 0 0 1 242.968 236.433 Tm /R156 196 0 R /x24 21 0 R /R31 62 0 R T* 1 0 0 1 254.285 236.433 Tm >> (Abstract) Tj /R150 206 0 R q << T* << /R31 62 0 R /R258 302 0 R 150.803 0 Td 1 0 0 1 75.3773 176.657 Tm 11.9563 TL 11 0 obj (\135\056) Tj /MediaBox [ 0 0 612 792 ] BT Single Shot MultiBox Detector (SSD) is an object detection algorithm that is a modification of the VGG16 architecture. ET 10 0 0 10 0 0 cm ET /R33 9.9626 Tf BT >> >> /Contents 242 0 R /R33 9.9626 Tf In the end, I managed to bring my implementation of SSD to apretty decent state, and this post gathers my thoughts on the matter. >> /R75 97 0 R A quick comparison between speed and accuracy of different object detection … /R35 7.9701 Tf endstream /ExtGState << Single Shot detector like YOLO takes only one shot to detect multiple objects present in an image using multibox. /R33 9.9626 Tf 0 g 1 1 1 rg /F2 322 0 R Loss Function. /R61 89 0 R [ (as) -279.01 (well) -280.019 (as) -279.01 (the) -279.005 (semantic) -280.007 (informa) 1 (tion) -280.007 (of) -279.012 (the) -279.002 (object\056) -397.992 (This) -280.007 (can) ] TJ For layers with only 4 bounding boxes, ar = 1/3 and 3 are omitted. /ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ] /Type /Page /R251 308 0 R 258.75 473.32 Td /R31 11.9552 Tf endobj >> (\054) Tj Use Icecream Instead, 7 A/B Testing Questions and Answers in Data Science Interviews, 10 Surprisingly Useful Base Python Functions, How to Become a Data Analyst and a Data Scientist, The Best Data Science Project to Have in Your Portfolio, Three Concepts to Become a Better Python Programmer, Social Network Analysis: From Graph Theory to Applications with Python, After going through a certain of convolutions for feature extraction, we obtain, Conv7: 19×19×6 = 2166 boxes (6 boxes for each location), Conv8_2: 10×10×6 = 600 boxes (6 boxes for each location), Conv9_2: 5×5×6 = 150 boxes (6 boxes for each location), Conv10_2: 3×3×4 = 36 boxes (4 boxes for each location), Conv11_2: 1×1×4 = 4 boxes (4 boxes for each location). Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. /R31 11.9552 Tf Q Q 3.31797 0 Td /R81 132 0 R /Rotate 0 /R33 11.9552 Tf ET q /Resources << /Font << >> 10 0 0 10 0 0 cm >> 1 0 0 1 181.17 152.747 Tm [ (1) -0.29866 ] TJ Faster R-CNN is more competitive on smaller objects with SSD. /R130 154 0 R endobj ET /CA 1 T* Q >> endstream 1 0 0 1 495.132 263.861 Tm -112.519 -11.9563 Td /R35 7.9701 Tf -131.057 -11.9563 Td ET q For object detection task, original base extractor is extended to a larger network by removing and adding some successive layers. Two common problems in single- shot detectors caused by object scale variations can be ob- served: (1) small objects are easily missed; (2) the salient part of a large object is sometimes detected as an object. /MediaBox [ 0 0 612 792 ] /x8 Do /F1 311 0 R Q /a0 << /x8 14 0 R << 0 1 0 rg Abstract: Due to the advantages of real-time detection and improved performance, single-shot detectors have gained great attention recently. /ca 1 Q [ (2) -0.30019 ] TJ /Length 28 0 g Matched default boxes. [ (\135) -244.982 (and) -243.992 (SSD) -244.006 (\133) ] TJ >> /Type /XObject 11.9547 -11.9711 Td With Atrous, the result is about the same. [ (32\0568) -241.981 (on) -241.009 (COCO) ] TJ Single-Shot-Object-Detection-Updated From the Udemy Course on Open-CV by Hadelin de Ponteves. >> /R43 9.9626 Tf For each scale, sk, we have 5 non-square aspect ratios: Therefore, we can have at most 6 bounding boxes in total with different aspect ratios. (named) ' /F1 113 0 R 1 0 0 1 124.612 152.747 Tm ET � 0�� This repository is a tutorial on how to use transfer learning for training your own custom object detection … (\050) Tj /BBox [ 78 746 96 765 ] >> /Contents 321 0 R endobj /Parent 1 0 R q (i\056e) Tj /Contents 227 0 R /R94 136 0 R /Resources 19 0 R /Font << [ (ent) -316.018 (sizes) -316.015 (and) -315 (aspect) -315.982 (ratios\056) -508.012 (SSD) -315.014 (uses) -316.013 (a) -315.991 (backbone) -316.013 (netw) 10.0081 (ork) ] TJ [ <736902657273> -224.015 (and) -223.016 (re) 15.0098 (gressors) -223.985 (in) -223.009 (a) -224.007 (dense) -223.994 (manner) -223.012 (without) -224.002 (objectness\055) ] TJ 10 0 0 10 0 0 cm T* T* 36.677 -41.0461 Td /ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ] q 11.9551 TL /x12 20 0 R /Pattern << ET /Contents 13 0 R stream << To overcome the weakness of missing detection on small object as mentioned in 6.4, “zoom out” operation is done to create more small training samples. Lloc is the localization loss which is the smooth L1 loss between the predicted box (l) and the ground-truth box (g) parameters. 12 0 obj /F2 220 0 R >> /R39 41 0 R 1 0 0 1 222.783 248.388 Tm [ (le) 25.0179 (v) 14.9828 (el) -301.006 (features) -299.992 (\050D\051) -301.011 (can) -299.992 (capture) -301.009 (both) -300.004 (the) -300.999 (basic) -300.019 (visual) -300.984 (pattern) ] TJ BT /R126 157 0 R [ (mentation) -456.982 (br) 14.9889 (anc) 14.984 (h) -457.997 (and) -457.007 (a) -457.017 (global) -458.007 (acti) 0.99493 (vation) -458.017 (module) 14.9975 (\056) -931.999 (The) ] TJ (\050) Tj (\054) Tj 11.9563 TL 1 0 0 1 250.729 188.612 Tm [ (Johns) -249.992 (Hopkins) -250.009 (Uni) 24.9957 (v) 14.9851 (ersity) ] TJ >> Q T* /F2 123 0 R [ (1) -0.30019 ] TJ /ExtGState << >> /R128 177 0 R 0 1 0 rg >> Object-Detection Classifier for custom objects using TensorFlow (GPU) and implementation in C++ Brief Summary. Q /Resources << /R80 100 0 R ET 1 0 0 rg function B = augmentData(A) % Apply random horizontal flipping, and … [ (tivation) -339.982 (is) -340.988 (to) -340.009 (enric) 14.9877 (h) -340.997 (the) -340.014 (semantics) -339.991 (of) -340.99 (object) -340.007 (detection) -341.017 (fea\055) ] TJ -187.854 -11.9551 Td T* /R88 124 0 R [ (detector) 40 (\054) -390.011 (which) -362.003 (tak) 10.0057 (es) -360.996 (V) 14.9803 (GG1) -1.00964 (6) -360.994 (as) -362.018 (its) -361.998 (backbone\054) -389.999 (and) -362.013 (detect) ] TJ T* T* /Annots [ 283 0 R 284 0 R 285 0 R 286 0 R 287 0 R 288 0 R 289 0 R 290 0 R 291 0 R 292 0 R 293 0 R 294 0 R 295 0 R 296 0 R ] T* [ <636c6173736902636174696f6e> -239.01 (and) -237.993 (re) 15.0098 (gression\073) -242.984 (and) -237.993 (the) -239.012 (one\055stage) -239.007 (frame) 25.013 (w) 10 (orks) ] TJ << /ExtGState << BT Q /XObject << [ (Pre) 24.983 (vious) -321.997 (single) -321.999 (shot) -322.982 (object) -322.009 (detectors\054) -340.005 (such) -322.012 (as) -321.983 (SSD\054) -323.01 (use) ] TJ 5.37891 -13.948 Td /Annots [ 221 0 R 222 0 R 223 0 R 224 0 R 225 0 R 226 0 R ] q The camera application deployed in recent computers uses object detection to identify face (s).There are … ET T* 0 g /Annots [ 312 0 R 313 0 R 314 0 R 315 0 R 316 0 R 317 0 R 318 0 R 319 0 R 320 0 R ] [ (le) 25.0179 (v) 14.9828 (el) -370.014 (detection) -369.992 (features) -371 (with) -369.992 (its) -369.997 (semantic) -369.992 (meaningful) -371.002 (fea\055) ] TJ (\054) Tj SSD: Single Shot Detection; Addressing object imbalance with focal loss; Common datasets and competitions; Further reading; Understanding the task. /R120 173 0 R /R33 9.9626 Tf (24) Tj /Annots [ 236 0 R 237 0 R 238 0 R 239 0 R 240 0 R 241 0 R ] endobj /s9 gs Q /R72 93 0 R All layers in between is regularly spaced. /R33 54 0 R Authors believe it is due to the RPN-based approaches which consist of two shots. /R39 41 0 R Q BT /ColorSpace << /R79 92 0 R BT << q /R33 11.9552 Tf /R253 306 0 R /R257 303 0 R [ (chal) -315.984 (manner) 54.981 (\056) -507.011 (Smaller) -315.016 (objects) -316.006 (are) -315.996 (detected) -315.001 (by) -316.016 (lo) 24.986 (wer) -315.991 (layers) ] TJ /R33 9.9626 Tf [ (Bo) -250.01 (W) 79.9984 (ang) ] TJ >> [ (In) -327.982 (this) -328 (paper) 39.9909 (\054) -348.018 (we) -329.006 (aim) -328.009 (to) -328.014 (address) -328.014 (the) -328.994 (proble) 0.99493 (m) -328.994 (discussed) ] TJ T* /R124 166 0 R [ (ferent) -250 (layers\056) -310.017 (This) -250.015 (is) -249.985 (sho) 24.9934 (wn) -249.988 (in) -249.988 (the) -249.988 (upper) -249.988 (part) -249.993 (of) -249.997 (Figure) ] TJ Supporting Functions. BT x�+��O4PH/VЯ02Qp�� [ (\073) -0.10109 ] TJ /R41 58 0 R ET Q /ca 1 Q BT But the one without atrous is about 20% slower. Object detection is modeled as a classification problem. /R41 58 0 R /Title (Single\055Shot Object Detection With Enriched Semantics) /Rotate 0 /Length 28 11.9547 TL T* This loss is similar to the one in Faster R-CNN. T* [ (acti) 24.9811 (v) 24.9811 (ation) -342.002 (module) -341.012 (for) -341.985 (higher) -342.007 (le) 25.0203 (v) 14.9828 (el) -340.997 (detection) -341.997 (feature) -341.987 (maps\056) ] TJ ET Smin is 0.2, Smax is 0.9. /ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ] 0 1 0 rg /Type /Group 13 0 obj 1 0 0 1 192.418 248.388 Tm /R94 136 0 R /F2 230 0 R -11.9547 -11.9551 Td q /R128 177 0 R 48.406 786.422 515.188 -52.699 re Normally, the accuracy is improved from 62.4% to 74.6%. 10 0 0 10 0 0 cm 10 0 0 10 0 0 cm /ProcSet [ /Text /ImageC /ImageB /PDF /ImageI ] /R33 9.9626 Tf endobj /Pages 1 0 R 10 0 0 10 0 0 cm Q (test\055dev) Tj Single Shot object detection or SSD takes one single shot to detect multiple objects within the image. 10 0 0 10 0 0 cm >> ET /R31 14.3462 Tf Q /R33 54 0 R /S /Transparency This can lead to faster optimization and a more stable training. /a0 << >> 62.2918 0 Td /R33 9.9626 Tf -213.07 -13.7219 Td /R33 54 0 R -203.566 -11.9559 Td /a0 << BT 0 g endobj x�e�� AC����̬wʠ� ��=p���,?��]%���+H-lo�䮬�9L��C>�J��c���� ��"82w�8V�Sn�GW;�" 4 0 obj /Annots [ 324 0 R 325 0 R 326 0 R 327 0 R 328 0 R 329 0 R 330 0 R 331 0 R 332 0 R 333 0 R 334 0 R 335 0 R 336 0 R 337 0 R 338 0 R 339 0 R 340 0 R 341 0 R 342 0 R 343 0 R 344 0 R 345 0 R 346 0 R 347 0 R 348 0 R 349 0 R 350 0 R 351 0 R 352 0 R 353 0 R 354 0 R 355 0 R 356 0 R 357 0 R 358 0 R 359 0 R 360 0 R 361 0 R 362 0 R 363 0 R 364 0 R 365 0 R 366 0 R 367 0 R 368 0 R 369 0 R 370 0 R 371 0 R 372 0 R 373 0 R 374 0 R 375 0 R 376 0 R 377 0 R ] /I true endstream /R33 9.9626 Tf 19.677 -4.33906 Td (2) Tj The input image should be of low resolution. >> /R122 170 0 R /x10 Do /R41 9.9626 Tf >> 1 0 0 1 228.004 248.388 Tm q [ (visual) -250.01 (pattern) -249.985 (and) -249.993 (semantically) -249.997 (meaningful) -250.002 (kno) 24.9909 (wledge\056) ] TJ stream endobj 1 0 0 1 444.294 92.9555 Tm Single-Shot Detector (SSD) ¶ SSD has two components: a backbone model and SSD head. T* [ (zhshuai\056zhang\100gmail\056com) -2400.02 (siyuan\056qiao\100jhu\056edu) -2400 (cihangxie306\100gmail\056com) ] TJ [ (milliseconds) -340.011 (per) -338.995 (ima) 10.013 (g) 10.0032 (e) -339.987 (on) -340.012 (a) -338.997 (T) 54.9859 (itan) -339.997 (Xp) -340.013 (GPU) 24.986 (\056) -339 (W) 55.0129 (ith) -340.002 (a) -340.017 (lower) ] TJ /MediaBox [ 0 0 612 792 ] q /F2 310 0 R June 25, 2019 Evolution of object detection algorithms leading to SSD. ��b�];�1�����5Y��y�R� {7QL.��\:Rv��/x�9�l�+�L��7�h%1!�}��i/�A��I(���kz"U��&,YO�! /Rotate 0 /Subtype /Form /F1 40 0 R /BBox [ 67 752 84 775 ] /R33 9.9626 Tf 1 0 0 1 500.114 263.861 Tm 1 0 0 1 0 0 cm ET Q /Resources 22 0 R q 10 0 0 10 0 0 cm >> 4.23398 0 Td [ (In) -378.993 (addition) -378.998 (to) -378.983 (the) -378.988 (se) 15.0196 (gmentation) -378.991 (branch) -378.991 (attached) -378.991 (to) -378.986 (the) ] TJ 1 0 0 1 237.966 248.388 Tm T* The most … /Type /Page Q /ExtGState << [ (Alan) -249.997 (L\056) -250.018 (Y) 110.994 (uille) ] TJ q � 0�� As shown above, SSD512 has 81.6% mAP. Q >> (I hope I can review DeepLab to cover this in more details in the coming future.). /ProcSet [ /Text /ImageC /ImageB /PDF /ImageI ] [ (The) -438 (se) 15.0196 (gmentation) -437.02 (branch) -438 (is) -437.01 (used) -437.996 (to) -437.996 (augment) -437 (the) -437.996 (lo) 24.986 (w) ] TJ /R39 41 0 R q /R33 9.9626 Tf /R33 9.9626 Tf /R39 8.9664 Tf endobj To have more accurate detection, different layers of feature maps are also going through a small 3×3 convolution for object detection as shown above. /ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ] ET /Rotate 0 q BT /R86 141 0 R /R30 32 0 R /Shading << 10 0 0 10 0 0 cm But the above it’s just a part of SSD. >> 9 0 obj >> /R33 9.9626 Tf /R33 54 0 R Q As we can see, the feature maps are large at Conv6 and Conv7, using Atrous convolution as shown above can increase the receptive field while keeping number of parameters relatively fewer compared with conventional convolution. ET Q T* /Parent 1 0 R /CS /DeviceRGB /R41 58 0 R /R41 9.9626 Tf 0 1 0 rg >> /R35 7.9701 Tf 0 g /ca 1 /R76 96 0 R endobj 19.6773 -4.33906 Td /R30 32 0 R BT /R132 162 0 R >> 270 32 72 14 re I… ET /Type /Pages >> � 0�� endobj /Font << (13) Tj BT -30.2465 -11.9551 Td x�eQKn!�s�� �?F�P���������a�v6���R�٪TS���.����� /CS /DeviceRGB /XObject << [ (block) -274.016 (can) -273.016 (prune) -273.986 (out) -272.996 (the) -273.986 (location) -272.984 (information\054) -279.988 (and) -273.008 (learn) -273.993 (the) ] TJ Q endobj [ (\135\054) -420.01 (and) -386.988 (se) 15.0183 (gmentation) -386.009 (\133) ] TJ (5813) Tj /Subtype /Form q 10 0 0 10 0 0 cm -15.0641 -11.9551 Td /R30 32 0 R -177.091 -11.9547 Td /R94 136 0 R [ (of) -348.998 (the) -348.013 (original) -349.01 (lo) 24.9885 (w) -348.018 (le) 25.0179 (v) 14.9828 (el) -348.998 (feature) -347.986 (map) -348.986 (is) -348.006 (acti) 24.9811 (v) 24.9811 (ated) -348.996 (by) -348.011 (a) -349.005 (se\055) ] TJ ET /R56 87 0 R /XObject << /R33 9.9626 Tf -11.9547 -11.9551 Td Q /Resources << BT 19.6758 -4.33906 Td 13.698 -4.33789 Td 29.0867 0 Td [ (mainly) -521.998 (tw) 10.0081 (o) -521.986 (series) -521.984 (of) -521.01 (object) -522.015 (detection) -522.006 (frame) 25.013 (w) 10 (orks\072) -853.995 (the) ] TJ Single-Shot Refinement Neural Network for Object Detection Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li For object detection, the two-stage approach (e.g., Faster R-CNN) has … SSD : Understanding single shot object detection. /R126 157 0 R /R33 11.9552 Tf >> /R67 106 0 R /Font << Q -11.9547 -11.9551 Td ET /R33 11.9552 Tf Lconf is the confidence loss which is the softmax loss over multiple classes confidences (c). /Type /Catalog /Author (Zhishuai Zhang\054 Siyuan Qiao\054 Cihang Xie\054 Wei Shen\054 Bo Wang\054 Alan L\056 Yuille) /a0 << ET 1 0 0 1 138.281 236.433 Tm 11.9551 TL q /R33 9.9626 Tf 7 0 obj (α is set to 1 by cross validation.) One-stage methods are more widely used because of their high efficiency but are limited by their performances on small object detection. 10 0 0 10 0 0 cm ET /R33 9.9626 Tf (17) Tj stream [ (vision) -490 (has) -489.01 (been) -489.995 (impro) 15.0036 (v) 14.9828 (ed) -489.015 <7369676e690263616e746c79> -490.014 (in) -489.004 (man) 14.9901 (y) -489.992 (aspects) ] TJ /Subtype /Form /R33 9.9626 Tf 0 g /Type /Page /R262 301 0 R /R33 9.9626 Tf /S /Transparency /R120 173 0 R /R60 111 0 R /Subject (2018 IEEE Conference on Computer Vision and Pattern Recognition) 10 0 0 10 0 0 cm 10 0 0 10 0 0 cm BT BT /CA 1 0 g [ (tw) 10.0081 (o) -271.989 (problems\072) -353 (small) -272.004 (obj) 0.99738 (ects) -271.989 (may) -271.979 (not) -270.994 (be) -271.994 (detected) -271.989 (well\054) -276.998 (and) ] TJ /R33 54 0 R /F1 244 0 R endstream 1 0 0 1 270.594 236.433 Tm << q /F1 12 Tf /R33 11.9552 Tf >> For more details, see Code Generation for Object Detection by Using Single Shot Multibox Detector example. q Thus, SSD is one of the object detection approaches that need to be studied. /R35 7.9701 Tf n [ (classes) -268.992 (in) -269.002 (a) -269.009 (self\055supervised) -269.013 (manner) 110.981 (\056) -366.995 (Compr) 37.0061 (ehensive) -269.009 (e) 19.9918 (xper) 20 (\055) ] TJ /I true /Contents 86 0 R 10 0 0 10 0 0 cm BT (\054) Tj Thus, SSD is much faster compared with two-shot RPN-based approaches. BT /s7 36 0 R (11) Tj 1 0 0 1 533.887 349.315 Tm /R35 7.9701 Tf 0 g BT /R86 141 0 R [ (e) 15.0122 (\056g) ] TJ >> /R77 91 0 R [ (gi) 24.9885 (v) 14.9828 (es) -341 (an) -339.982 (illustration) -340.002 (of) -339.982 (this) -341.017 (semant) 1 (ic) -340.997 (augmen\055) ] TJ >> SSD300 achieves 74.3% mAP at 59 FPS while SSD500 achieves 76.9% mAP at 22 FPS, which outperforms Faster R-CNN (73.2% mAP at 7 FPS) and YOLOv1 (63.4% mAP at 45 FPS). q Backbone model usually is a pre-trained image classification network as a feature extractor. /R158 215 0 R /F2 152 0 R (18) Tj >> >> q T* 11.9547 -13.7219 Td q Well-researched domains of object detection include face detection and pedestrian detection.Object detection … SSD: Single Shot MultiBox Detectorより引用 (a)が入力画像と各物体の正解ボックスです。 (b)と(c)のマス目は特徴マップの位置を表しており、各位置においてデフォルトボックスと呼ばれる異なるアス … T* BT 1 0 0 1 177.235 248.388 Tm [ (posed) -254.02 (method\056) -322.001 (In) -253.015 (particular) 111.011 (\054) -255.014 (with) -254.003 (a) -253.992 (VGG16) -254.016 (based) -254.019 (DES\054) -253.982 (we) ] TJ /Resources << /R35 7.9701 Tf /R146 202 0 R /R33 9.9626 Tf /R33 54 0 R BT >> /CA 1 /R74 98 0 R ET -62.1207 -37.8582 Td /Type /Page /Type /XObject Object Detection. /a0 << 10 0 0 10 0 0 cm BT A feature extraction network, followed by a detection network. /s9 26 0 R q Typically, small objects are detected on shallow layers while large objects … Q 10 0 0 10 0 0 cm 11.7461 0 Td I had initially intendedfor it to help identify traffic lights in my team's SDCND CapstoneProject. [ (Among) -272.983 (them\054) -278.01 (object) -272.997 (detection) -272.99 (is) -273 (a) -273.018 (fundamental) -272.984 (task) -272.999 (which) ] TJ /Font << T* >> /Rotate 0 /Resources << 1 0 0 1 130.847 675.067 Tm q 0 g /R130 154 0 R /BBox [ 67 752 84 775 ] Q /Annots [ 245 0 R 246 0 R 247 0 R 248 0 R 249 0 R 250 0 R 251 0 R 252 0 R 253 0 R 254 0 R 255 0 R 256 0 R 257 0 R 258 0 R 259 0 R 260 0 R 261 0 R 262 0 R 263 0 R 264 0 R 265 0 R 266 0 R 267 0 R 268 0 R 269 0 R 270 0 R 271 0 R 272 0 R 273 0 R 274 0 R 275 0 R 276 0 R 277 0 R 278 0 R 279 0 R ] (3) Tj << /F1 323 0 R [ (while) -213.006 (lar) 17.997 (ger) -212.987 (objects) -213.001 (are) -212.996 (detected) -213.018 (by) -213.013 (higher) -213.009 (layers\056) -297.98 (Ho) 24.986 (we) 25.0154 (v) 14.9828 (er) 39.9835 (\054) ] TJ [ (\135\054) -245.019 (which) -243.984 (apply) -245 (object) -243.984 (clas\055) ] TJ /XObject << Q 54.7898 4.33906 Td q /Rotate 0 /R172 228 0 R BT >> The feature extraction network is typically a pretrained CNN … /R90 129 0 R While classification is about predicting label of the object present in an image, detection goes further than that and finds locations of those objects too. The function detection… [ (Shanghai) -249.989 (Uni) 24.9957 (v) 14.9851 (ersity) ] TJ [ (and) -239.004 (an) -237.99 (mAP) -239.009 (of) ] TJ -11.9551 -11.9551 Td This time, SSD (Single Shot Detector) is reviewed. /Type /XObject /R77 91 0 R 19.377 0 Td 0 g 1 0 0 1 114.65 152.747 Tm /R35 48 0 R The SSD object detection network can be thought of as having two sub-networks. 1 0 0 1 212.821 248.388 Tm (\054) Tj ET /R173 229 0 R [ (abo) 14.9828 (v) 14.9828 (e\054) -301.991 (by) -291 (designing) -291.988 (a) -291.015 (no) 14.9877 (v) 14.9828 (el) -291.005 (single) -292.005 (shot) -290.986 (detection) -292.005 (netw) 10.0081 (ork\054) ] TJ 11.9559 TL [ (the) -229.993 (quality) -230.988 (of) -229.998 (high) -230.998 (le) 25.0179 (v) 14.9828 (el) -229.998 (features) -230.981 (is) -229.986 (also) -229.986 (damaged) -231.006 (by) -229.991 (the) -231.01 (im\055) ] TJ /s11 gs [ (tection) -385.991 (\133) ] TJ ET [ (Based) -445.989 (on) -447.014 (that\054) -494.998 (se) 25.0179 (v) 14.9828 (eral) -446.989 (layers) -446 (of) -446.984 (object) -445.989 (detection) -446.999 (feature) ] TJ [ (1) -0.30019 ] TJ BT >> [ (based) -330.996 (pruning\056) -553.013 (Both) -331.012 (of) -331.002 (them) -332.003 (do) -330.994 <636c6173736902636174696f6e> -330.994 (and) -330.999 (re) 15.0098 (gres\055) ] TJ q /MediaBox [ 0 0 612 792 ] /R33 9.9626 Tf /I true q [ (Zhishuai) -250.003 (Zhang) ] TJ [ (the) -372.009 (right) -371.994 (lo) 24.9885 (wer) -371.982 (corner) -373.001 (of) -372.016 (Figure) ] TJ 11.9559 TL endobj 11.9547 TL [ (tures\054) -249.985 (as) -249.995 (sho) 24.9934 (wn) -249.99 (in) -249.985 (the) -249.99 (left) -249.993 (lo) 24.986 (wer) -250.002 (part) -249.993 (of) -249.997 (Figure) ] TJ /R158 215 0 R 10 0 0 10 0 0 cm T* 75.377 4.33906 Td /R92 118 0 R With more default box shapes, it improves from 71.6% to 74.3% mAP. 10 0 0 10 0 0 cm Q /R124 166 0 R q 1 0 0 1 95.7207 236.433 Tm T* [ (objects) -318.984 (with) -318.003 (multiple) -318.998 (object) -318.991 (detection) -318.981 (feature) -317.991 (maps) -318.986 (in) -318.996 (dif\055) ] TJ >> /R31 62 0 R << duh. /Parent 1 0 R [ (1) -0.30019 ] TJ BT q [ (W) 80 (ei) -249.987 (Shen) ] TJ And pool5 is changed from 2×2-s2 to 3×3-s1. /x6 Do /x6 17 0 R 1 1 1 rg /R128 177 0 R endobj /R130 154 0 R /ExtGState << 1 0 0 1 144.627 236.433 Tm 10 0 0 10 0 0 cm T* /Filter /FlateDecode 17 0 obj So the images(as shown in Figure 2), where multiple objects with different scales/sizes are present at different loca… q /ca 1 /R41 58 0 R 1 0 0 1 154.589 236.433 Tm BT /Filter /FlateDecode 11.9551 TL Q T* /XObject << /R33 9.9626 Tf /R56 87 0 R /Resources << /R92 118 0 R /R174 231 0 R (1) Tj 11.9559 TL /R122 170 0 R (Figure) Tj << -397.804 -18.2859 Td q 11.9551 TL Hence, SSD has 8732 bounding boxes which is more than that of YOLO. >> Q 10 0 0 10 0 0 cm T* /ProcSet [ /Text /ImageC /ImageB /PDF /ImageI ] /MediaBox [ 0 0 612 792 ] /Group << /Font << [ (gr) 44.9839 (ound\055truth\054) ] TJ [ (W) 91.9865 (e) -264 (pr) 46.0034 (opo) -1.00412 (s) 0.98635 (e) -264 (a) -263.01 (no) 10.0081 (vel) -263.996 (single) -262.989 (shot) -264.011 (object) -263 (detection) -264.01 (network) ] TJ /Type /Page Single-Shot Detection.

Ignoring Your Ex Girlfriend Who Dumped You, Cartoons For Toddlers Disney, I Recommend You To, Hsbc Share Price Usd, Grover Virtual Run, Function Of Skin,

Subscribe
Powiadom o
guest
0 komentarzy
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x